mirror of
https://github.com/donnemartin/system-design-primer.git
synced 2025-09-17 09:30:39 +03:00
poriting to noat.cards
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# 设计 Mint.com
|
||||
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题索引)中的有关部分,以避免重复的内容。您可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题索引) 中的有关部分,以避免重复的内容。您可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
|
||||
## 第一步:简述用例与约束条件
|
||||
|
||||
@@ -80,7 +80,7 @@
|
||||
|
||||
> 列出所有重要组件以规划概要设计。
|
||||
|
||||

|
||||

|
||||
|
||||
## 第三步:设计核心组件
|
||||
|
||||
@@ -88,9 +88,9 @@
|
||||
|
||||
### 用例:用户连接到一个财务账户
|
||||
|
||||
我们可以将 1000 万用户的信息存储在一个[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)中。我们应该讨论一下[选择SQL或NoSQL之间的用例和权衡](https://github.com/donnemartin/system-design-primer#sql-or-nosql)了。
|
||||
我们可以将 1000 万用户的信息存储在一个[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 中。我们应该讨论一下[选择SQL或NoSQL之间的用例和权衡](https://github.com/donnemartin/system-design-primer#sql-or-nosql) 了。
|
||||
|
||||
* **客户端** 作为一个[反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server),发送请求到 **Web 服务器**
|
||||
* **客户端** 作为一个[反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server) ,发送请求到 **Web 服务器**
|
||||
* **Web 服务器** 转发请求到 **账户API** 服务器
|
||||
* **账户API** 服务器将新输入的账户信息更新到 **SQL数据库** 的`accounts`表
|
||||
|
||||
@@ -106,13 +106,13 @@ account_url varchar(255) NOT NULL
|
||||
account_login varchar(32) NOT NULL
|
||||
account_password_hash char(64) NOT NULL
|
||||
user_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
```
|
||||
|
||||
我们将在`id`,`user_id`和`created_at`等字段上创建一个[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)以加速查找(对数时间而不是扫描整个表)并保持数据在内存中。从内存中顺序读取 1 MB数据花费大约250毫秒,而从SSD读取是其4倍,从磁盘读取是其80倍。<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
我们将在`id`,`user_id`和`created_at`等字段上创建一个[索引](https://github.com/donnemartin/system-design-primer#use-good-indices) 以加速查找(对数时间而不是扫描整个表)并保持数据在内存中。从内存中顺序读取 1 MB数据花费大约250毫秒,而从SSD读取是其4倍,从磁盘读取是其80倍。<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
我们将使用公开的[**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
我们将使用公开的[**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
|
||||
@@ -120,7 +120,7 @@ $ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
|
||||
https://mint.com/api/v1/account
|
||||
```
|
||||
|
||||
对于内部通信,我们可以使用[远程过程调用](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)。
|
||||
对于内部通信,我们可以使用[远程过程调用](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) 。
|
||||
|
||||
接下来,服务从账户中提取交易。
|
||||
|
||||
@@ -136,8 +136,8 @@ $ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
|
||||
|
||||
* **客户端**向 **Web服务器** 发送请求
|
||||
* **Web服务器** 将请求转发到 **帐户API** 服务器
|
||||
* **帐户API** 服务器将job放在 **队列** 中,如 [Amazon SQS](https://aws.amazon.com/sqs/) 或者 [RabbitMQ](https://www.rabbitmq.com/)
|
||||
* 提取交易可能需要一段时间,我们可能希望[与队列异步](https://github.com/donnemartin/system-design-primer#asynchronism)地来做,虽然这会引入额外的复杂度。
|
||||
* **帐户API** 服务器将job放在 **队列** 中,如 [Amazon SQS](https://aws.amazon.com/sqs/) 或者 [RabbitMQ](https://www.rabbitmq.com/)
|
||||
* 提取交易可能需要一段时间,我们可能希望[与队列异步](https://github.com/donnemartin/system-design-primer#asynchronism) 地来做,虽然这会引入额外的复杂度。
|
||||
* **交易提取服务** 执行如下操作:
|
||||
* 从 **Queue** 中拉取并从金融机构中提取给定用户的交易,将结果作为原始日志文件存储在 **对象存储区**。
|
||||
* 使用 **分类服务** 来分类每个交易
|
||||
@@ -156,25 +156,25 @@ created_at datetime NOT NULL
|
||||
seller varchar(32) NOT NULL
|
||||
amount decimal NOT NULL
|
||||
user_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
```
|
||||
|
||||
我们将在 `id`,`user_id`,和 `created_at`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)。
|
||||
我们将在 `id`,`user_id`,和 `created_at`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices) 。
|
||||
|
||||
`monthly_spending`表应该具有如下结构:
|
||||
|
||||
```
|
||||
id int NOT NULL AUTO_INCREMENT
|
||||
month_year date NOT NULL
|
||||
category varchar(32)
|
||||
category varchar(32)
|
||||
amount decimal NOT NULL
|
||||
user_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
```
|
||||
|
||||
我们将在`id`,`user_id`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices)。
|
||||
我们将在`id`,`user_id`字段上创建[索引](https://github.com/donnemartin/system-design-primer#use-good-indices) 。
|
||||
|
||||
#### 分类服务
|
||||
|
||||
@@ -183,7 +183,7 @@ FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
**告知你的面试官你准备写多少代码**。
|
||||
|
||||
```python
|
||||
class DefaultCategories(Enum):
|
||||
class DefaultCategories(Enum) :
|
||||
|
||||
HOUSING = 0
|
||||
FOOD = 1
|
||||
@@ -200,19 +200,19 @@ seller_category_map['Target'] = DefaultCategories.SHOPPING
|
||||
对于一开始没有在映射中的卖家,我们可以通过评估用户提供的手动类别来进行众包。在 O(1) 时间内,我们可以用堆来快速查找每个卖家的顶端的手动覆盖。
|
||||
|
||||
```python
|
||||
class Categorizer(object):
|
||||
class Categorizer(object) :
|
||||
|
||||
def __init__(self, seller_category_map, self.seller_category_crowd_overrides_map):
|
||||
def __init__(self, seller_category_map, self.seller_category_crowd_overrides_map) :
|
||||
self.seller_category_map = seller_category_map
|
||||
self.seller_category_crowd_overrides_map = \
|
||||
seller_category_crowd_overrides_map
|
||||
|
||||
def categorize(self, transaction):
|
||||
def categorize(self, transaction) :
|
||||
if transaction.seller in self.seller_category_map:
|
||||
return self.seller_category_map[transaction.seller]
|
||||
elif transaction.seller in self.seller_category_crowd_overrides_map:
|
||||
self.seller_category_map[transaction.seller] = \
|
||||
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
|
||||
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
|
||||
return self.seller_category_map[transaction.seller]
|
||||
return None
|
||||
```
|
||||
@@ -220,9 +220,9 @@ class Categorizer(object):
|
||||
交易实现:
|
||||
|
||||
```python
|
||||
class Transaction(object):
|
||||
class Transaction(object) :
|
||||
|
||||
def __init__(self, created_at, seller, amount):
|
||||
def __init__(self, created_at, seller, amount) :
|
||||
self.timestamp = timestamp
|
||||
self.seller = seller
|
||||
self.amount = amount
|
||||
@@ -234,13 +234,13 @@ class Transaction(object):
|
||||
`TABLE budget_overrides`中存储此覆盖。
|
||||
|
||||
```python
|
||||
class Budget(object):
|
||||
class Budget(object) :
|
||||
|
||||
def __init__(self, income):
|
||||
def __init__(self, income) :
|
||||
self.income = income
|
||||
self.categories_to_budget_map = self.create_budget_template()
|
||||
self.categories_to_budget_map = self.create_budget_template()
|
||||
|
||||
def create_budget_template(self):
|
||||
def create_budget_template(self) :
|
||||
return {
|
||||
'DefaultCategories.HOUSING': income * .4,
|
||||
'DefaultCategories.FOOD': income * .2
|
||||
@@ -249,7 +249,7 @@ class Budget(object):
|
||||
...
|
||||
}
|
||||
|
||||
def override_category_budget(self, category, amount):
|
||||
def override_category_budget(self, category, amount) :
|
||||
self.categories_to_budget_map[category] = amount
|
||||
```
|
||||
|
||||
@@ -275,26 +275,26 @@ user_id timestamp seller amount
|
||||
**MapReduce** 实现:
|
||||
|
||||
```python
|
||||
class SpendingByCategory(MRJob):
|
||||
class SpendingByCategory(MRJob) :
|
||||
|
||||
def __init__(self, categorizer):
|
||||
def __init__(self, categorizer) :
|
||||
self.categorizer = categorizer
|
||||
self.current_year_month = calc_current_year_month()
|
||||
self.current_year_month = calc_current_year_month()
|
||||
...
|
||||
|
||||
def calc_current_year_month(self):
|
||||
def calc_current_year_month(self) :
|
||||
"""返回当前年月"""
|
||||
...
|
||||
|
||||
def extract_year_month(self, timestamp):
|
||||
def extract_year_month(self, timestamp) :
|
||||
"""返回时间戳的年,月部分"""
|
||||
...
|
||||
|
||||
def handle_budget_notifications(self, key, total):
|
||||
def handle_budget_notifications(self, key, total) :
|
||||
"""如果接近或超出预算,调用通知API"""
|
||||
...
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""解析每个日志行,提取和转换相关行。
|
||||
|
||||
参数行应为如下形式:
|
||||
@@ -303,31 +303,31 @@ class SpendingByCategory(MRJob):
|
||||
|
||||
使用分类器来将卖家转换成类别,生成如下形式的key-value对:
|
||||
|
||||
(user_id, 2016-01, shopping), 25
|
||||
(user_id, 2016-01, shopping), 100
|
||||
(user_id, 2016-01, gas), 50
|
||||
(user_id, 2016-01, shopping) , 25
|
||||
(user_id, 2016-01, shopping) , 100
|
||||
(user_id, 2016-01, gas) , 50
|
||||
"""
|
||||
user_id, timestamp, seller, amount = line.split('\t')
|
||||
category = self.categorizer.categorize(seller)
|
||||
period = self.extract_year_month(timestamp)
|
||||
user_id, timestamp, seller, amount = line.split('\t')
|
||||
category = self.categorizer.categorize(seller)
|
||||
period = self.extract_year_month(timestamp)
|
||||
if period == self.current_year_month:
|
||||
yield (user_id, period, category), amount
|
||||
yield (user_id, period, category) , amount
|
||||
|
||||
def reducer(self, key, value):
|
||||
def reducer(self, key, value) :
|
||||
"""将每个key对应的值求和。
|
||||
|
||||
(user_id, 2016-01, shopping), 125
|
||||
(user_id, 2016-01, gas), 50
|
||||
(user_id, 2016-01, shopping) , 125
|
||||
(user_id, 2016-01, gas) , 50
|
||||
"""
|
||||
total = sum(values)
|
||||
yield key, sum(values)
|
||||
total = sum(values)
|
||||
yield key, sum(values)
|
||||
```
|
||||
|
||||
## 第四步:设计扩展
|
||||
|
||||
> 根据限制条件,找到并解决瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要提示:不要从最初设计直接跳到最终设计中!**
|
||||
|
||||
@@ -337,20 +337,20 @@ class SpendingByCategory(MRJob):
|
||||
|
||||
我们将会介绍一些组件来完成设计,并解决架构扩张问题。内置的负载均衡器将不做讨论以节省篇幅。
|
||||
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [异步](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#异步)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [异步](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#异步)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
|
||||
我们将增加一个额外的用例:**用户** 访问摘要和交易数据。
|
||||
|
||||
@@ -366,7 +366,7 @@ class SpendingByCategory(MRJob):
|
||||
* 如果URL在 **SQL 数据库**中,获取该内容
|
||||
* 以其内容更新 **内存缓存**
|
||||
|
||||
参考 [何时更新缓存](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) 中权衡和替代的内容。以上方法描述了 [cache-aside缓存模式](https://github.com/donnemartin/system-design-primer#cache-aside).
|
||||
参考 [何时更新缓存](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) 中权衡和替代的内容。以上方法描述了 [cache-aside缓存模式](https://github.com/donnemartin/system-design-primer#cache-aside) .
|
||||
|
||||
我们可以使用诸如 Amazon Redshift 或者 Google BigQuery 等数据仓库解决方案,而不是将`monthly_spending`聚合表保留在 **SQL 数据库** 中。
|
||||
|
||||
@@ -376,10 +376,10 @@ class SpendingByCategory(MRJob):
|
||||
|
||||
*平均* 200 次交易写入每秒(峰值时更高)对于单个 **SQL 写入主-从服务** 来说可能是棘手的。我们可能需要考虑其它的 SQL 性能拓展技术:
|
||||
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
|
||||
我们也可以考虑将一些数据移至 **NoSQL 数据库**。
|
||||
|
||||
@@ -389,50 +389,50 @@ class SpendingByCategory(MRJob):
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 在哪缓存
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* 什么需要缓存
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* 何时更新缓存
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
|
||||
### 异步与微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
|
||||
### 通信
|
||||
|
||||
* 可权衡选择的方案:
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
|
||||
### 安全性
|
||||
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)一章。
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 一章。
|
||||
|
||||
### 延迟数值
|
||||
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数) 。
|
||||
|
||||
### 持续探讨
|
||||
|
||||
|
@@ -80,7 +80,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -88,9 +88,9 @@ Handy conversion guide:
|
||||
|
||||
### Use case: User connects to a financial account
|
||||
|
||||
We could store info on the 10 million users in a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms). We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
|
||||
We could store info on the 10 million users in a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) . We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) .
|
||||
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Accounts API** server
|
||||
* The **Accounts API** server updates the **SQL Database** `accounts` table with the newly entered account info
|
||||
|
||||
@@ -106,13 +106,13 @@ account_url varchar(255) NOT NULL
|
||||
account_login varchar(32) NOT NULL
|
||||
account_password_hash char(64) NOT NULL
|
||||
user_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
```
|
||||
|
||||
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id`, `user_id `, and `created_at` to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
|
||||
@@ -120,7 +120,7 @@ $ curl -X POST --data '{ "user_id": "foo", "account_url": "bar", \
|
||||
https://mint.com/api/v1/account
|
||||
```
|
||||
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
|
||||
|
||||
Next, the service extracts transactions from the account.
|
||||
|
||||
@@ -136,8 +136,8 @@ Data flow:
|
||||
|
||||
* The **Client** sends a request to the **Web Server**
|
||||
* The **Web Server** forwards the request to the **Accounts API** server
|
||||
* The **Accounts API** server places a job on a **Queue** such as [Amazon SQS](https://aws.amazon.com/sqs/) or [RabbitMQ](https://www.rabbitmq.com/)
|
||||
* Extracting transactions could take awhile, we'd probably want to do this [asynchronously with a queue](https://github.com/donnemartin/system-design-primer#asynchronism), although this introduces additional complexity
|
||||
* The **Accounts API** server places a job on a **Queue** such as [Amazon SQS](https://aws.amazon.com/sqs/) or [RabbitMQ](https://www.rabbitmq.com/)
|
||||
* Extracting transactions could take awhile, we'd probably want to do this [asynchronously with a queue](https://github.com/donnemartin/system-design-primer#asynchronism) , although this introduces additional complexity
|
||||
* The **Transaction Extraction Service** does the following:
|
||||
* Pulls from the **Queue** and extracts transactions for the given account from the financial institution, storing the results as raw log files in the **Object Store**
|
||||
* Uses the **Category Service** to categorize each transaction
|
||||
@@ -156,8 +156,8 @@ created_at datetime NOT NULL
|
||||
seller varchar(32) NOT NULL
|
||||
amount decimal NOT NULL
|
||||
user_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
```
|
||||
|
||||
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id`, `user_id `, and `created_at`.
|
||||
@@ -167,11 +167,11 @@ The `monthly_spending` table could have the following structure:
|
||||
```
|
||||
id int NOT NULL AUTO_INCREMENT
|
||||
month_year date NOT NULL
|
||||
category varchar(32)
|
||||
category varchar(32)
|
||||
amount decimal NOT NULL
|
||||
user_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(user_id) REFERENCES users(id)
|
||||
```
|
||||
|
||||
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id` and `user_id `.
|
||||
@@ -183,7 +183,7 @@ For the **Category Service**, we can seed a seller-to-category dictionary with t
|
||||
**Clarify with your interviewer how much code you are expected to write**.
|
||||
|
||||
```python
|
||||
class DefaultCategories(Enum):
|
||||
class DefaultCategories(Enum) :
|
||||
|
||||
HOUSING = 0
|
||||
FOOD = 1
|
||||
@@ -200,19 +200,19 @@ seller_category_map['Target'] = DefaultCategories.SHOPPING
|
||||
For sellers not initially seeded in the map, we could use a crowdsourcing effort by evaluating the manual category overrides our users provide. We could use a heap to quickly lookup the top manual override per seller in O(1) time.
|
||||
|
||||
```python
|
||||
class Categorizer(object):
|
||||
class Categorizer(object) :
|
||||
|
||||
def __init__(self, seller_category_map, seller_category_crowd_overrides_map):
|
||||
def __init__(self, seller_category_map, seller_category_crowd_overrides_map) :
|
||||
self.seller_category_map = seller_category_map
|
||||
self.seller_category_crowd_overrides_map = \
|
||||
seller_category_crowd_overrides_map
|
||||
|
||||
def categorize(self, transaction):
|
||||
def categorize(self, transaction) :
|
||||
if transaction.seller in self.seller_category_map:
|
||||
return self.seller_category_map[transaction.seller]
|
||||
elif transaction.seller in self.seller_category_crowd_overrides_map:
|
||||
self.seller_category_map[transaction.seller] = \
|
||||
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
|
||||
self.seller_category_crowd_overrides_map[transaction.seller].peek_min()
|
||||
return self.seller_category_map[transaction.seller]
|
||||
return None
|
||||
```
|
||||
@@ -220,9 +220,9 @@ class Categorizer(object):
|
||||
Transaction implementation:
|
||||
|
||||
```python
|
||||
class Transaction(object):
|
||||
class Transaction(object) :
|
||||
|
||||
def __init__(self, created_at, seller, amount):
|
||||
def __init__(self, created_at, seller, amount) :
|
||||
self.created_at = created_at
|
||||
self.seller = seller
|
||||
self.amount = amount
|
||||
@@ -233,13 +233,13 @@ class Transaction(object):
|
||||
To start, we could use a generic budget template that allocates category amounts based on income tiers. Using this approach, we would not have to store the 100 million budget items identified in the constraints, only those that the user overrides. If a user overrides a budget category, which we could store the override in the `TABLE budget_overrides`.
|
||||
|
||||
```python
|
||||
class Budget(object):
|
||||
class Budget(object) :
|
||||
|
||||
def __init__(self, income):
|
||||
def __init__(self, income) :
|
||||
self.income = income
|
||||
self.categories_to_budget_map = self.create_budget_template()
|
||||
self.categories_to_budget_map = self.create_budget_template()
|
||||
|
||||
def create_budget_template(self):
|
||||
def create_budget_template(self) :
|
||||
return {
|
||||
DefaultCategories.HOUSING: self.income * .4,
|
||||
DefaultCategories.FOOD: self.income * .2,
|
||||
@@ -248,7 +248,7 @@ class Budget(object):
|
||||
...
|
||||
}
|
||||
|
||||
def override_category_budget(self, category, amount):
|
||||
def override_category_budget(self, category, amount) :
|
||||
self.categories_to_budget_map[category] = amount
|
||||
```
|
||||
|
||||
@@ -274,26 +274,26 @@ user_id timestamp seller amount
|
||||
**MapReduce** implementation:
|
||||
|
||||
```python
|
||||
class SpendingByCategory(MRJob):
|
||||
class SpendingByCategory(MRJob) :
|
||||
|
||||
def __init__(self, categorizer):
|
||||
def __init__(self, categorizer) :
|
||||
self.categorizer = categorizer
|
||||
self.current_year_month = calc_current_year_month()
|
||||
self.current_year_month = calc_current_year_month()
|
||||
...
|
||||
|
||||
def calc_current_year_month(self):
|
||||
def calc_current_year_month(self) :
|
||||
"""Return the current year and month."""
|
||||
...
|
||||
|
||||
def extract_year_month(self, timestamp):
|
||||
def extract_year_month(self, timestamp) :
|
||||
"""Return the year and month portions of the timestamp."""
|
||||
...
|
||||
|
||||
def handle_budget_notifications(self, key, total):
|
||||
def handle_budget_notifications(self, key, total) :
|
||||
"""Call notification API if nearing or exceeded budget."""
|
||||
...
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Argument line will be of the form:
|
||||
@@ -303,31 +303,31 @@ class SpendingByCategory(MRJob):
|
||||
Using the categorizer to convert seller to category,
|
||||
emit key value pairs of the form:
|
||||
|
||||
(user_id, 2016-01, shopping), 25
|
||||
(user_id, 2016-01, shopping), 100
|
||||
(user_id, 2016-01, gas), 50
|
||||
(user_id, 2016-01, shopping) , 25
|
||||
(user_id, 2016-01, shopping) , 100
|
||||
(user_id, 2016-01, gas) , 50
|
||||
"""
|
||||
user_id, timestamp, seller, amount = line.split('\t')
|
||||
category = self.categorizer.categorize(seller)
|
||||
period = self.extract_year_month(timestamp)
|
||||
user_id, timestamp, seller, amount = line.split('\t')
|
||||
category = self.categorizer.categorize(seller)
|
||||
period = self.extract_year_month(timestamp)
|
||||
if period == self.current_year_month:
|
||||
yield (user_id, period, category), amount
|
||||
yield (user_id, period, category) , amount
|
||||
|
||||
def reducer(self, key, value):
|
||||
def reducer(self, key, value) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(user_id, 2016-01, shopping), 125
|
||||
(user_id, 2016-01, gas), 50
|
||||
(user_id, 2016-01, shopping) , 125
|
||||
(user_id, 2016-01, gas) , 50
|
||||
"""
|
||||
total = sum(values)
|
||||
yield key, sum(values)
|
||||
total = sum(values)
|
||||
yield key, sum(values)
|
||||
```
|
||||
|
||||
## Step 4: Scale the design
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -339,19 +339,19 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Asynchronism](https://github.com/donnemartin/system-design-primer#asynchronism)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Asynchronism](https://github.com/donnemartin/system-design-primer#asynchronism)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
We'll add an additional use case: **User** accesses summaries and transactions.
|
||||
|
||||
@@ -367,20 +367,20 @@ User sessions, aggregate stats by category, and recent transactions could be pla
|
||||
* If the url is in the **SQL Database**, fetches the contents
|
||||
* Updates the **Memory Cache** with the contents
|
||||
|
||||
Refer to [When to update the cache](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) for tradeoffs and alternatives. The approach above describes [cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside).
|
||||
Refer to [When to update the cache](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) for tradeoffs and alternatives. The approach above describes [cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside) .
|
||||
|
||||
Instead of keeping the `monthly_spending` aggregate table in the **SQL Database**, we could create a separate **Analytics Database** using a data warehousing solution such as Amazon Redshift or Google BigQuery.
|
||||
|
||||
We might only want to store a month of `transactions` data in the database, while storing the rest in a data warehouse or in an **Object Store**. An **Object Store** such as Amazon S3 can comfortably handle the constraint of 250 GB of new content per month.
|
||||
|
||||
To address the 200 *average* read requests per second (higher at peak), traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
|
||||
To address the 200 *average* read requests per second (higher at peak) , traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
|
||||
|
||||
2,000 *average* transaction writes per second (higher at peak) might be tough for a single **SQL Write Master-Slave**. We might need to employ additional SQL scaling patterns:
|
||||
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
@@ -390,50 +390,50 @@ We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -3,55 +3,55 @@
|
||||
from mrjob.job import MRJob
|
||||
|
||||
|
||||
class SpendingByCategory(MRJob):
|
||||
class SpendingByCategory(MRJob) :
|
||||
|
||||
def __init__(self, categorizer):
|
||||
def __init__(self, categorizer) :
|
||||
self.categorizer = categorizer
|
||||
...
|
||||
|
||||
def current_year_month(self):
|
||||
def current_year_month(self) :
|
||||
"""Return the current year and month."""
|
||||
...
|
||||
|
||||
def extract_year_month(self, timestamp):
|
||||
def extract_year_month(self, timestamp) :
|
||||
"""Return the year and month portions of the timestamp."""
|
||||
...
|
||||
|
||||
def handle_budget_notifications(self, key, total):
|
||||
def handle_budget_notifications(self, key, total) :
|
||||
"""Call notification API if nearing or exceeded budget."""
|
||||
...
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Emit key value pairs of the form:
|
||||
|
||||
(2016-01, shopping), 25
|
||||
(2016-01, shopping), 100
|
||||
(2016-01, gas), 50
|
||||
(2016-01, shopping) , 25
|
||||
(2016-01, shopping) , 100
|
||||
(2016-01, gas) , 50
|
||||
"""
|
||||
timestamp, category, amount = line.split('\t')
|
||||
period = self. extract_year_month(timestamp)
|
||||
if period == self.current_year_month():
|
||||
yield (period, category), amount
|
||||
timestamp, category, amount = line.split('\t')
|
||||
period = self. extract_year_month(timestamp)
|
||||
if period == self.current_year_month() :
|
||||
yield (period, category) , amount
|
||||
|
||||
def reducer(self, key, values):
|
||||
def reducer(self, key, values) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(2016-01, shopping), 125
|
||||
(2016-01, gas), 50
|
||||
(2016-01, shopping) , 125
|
||||
(2016-01, gas) , 50
|
||||
"""
|
||||
total = sum(values)
|
||||
self.handle_budget_notifications(key, total)
|
||||
yield key, sum(values)
|
||||
total = sum(values)
|
||||
self.handle_budget_notifications(key, total)
|
||||
yield key, sum(values)
|
||||
|
||||
def steps(self):
|
||||
def steps(self) :
|
||||
"""Run the map and reduce steps."""
|
||||
return [
|
||||
self.mr(mapper=self.mapper,
|
||||
reducer=self.reducer)
|
||||
reducer=self.reducer)
|
||||
]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
SpendingByCategory.run()
|
||||
SpendingByCategory.run()
|
||||
|
@@ -3,7 +3,7 @@
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class DefaultCategories(Enum):
|
||||
class DefaultCategories(Enum) :
|
||||
|
||||
HOUSING = 0
|
||||
FOOD = 1
|
||||
@@ -17,34 +17,34 @@ seller_category_map['Exxon'] = DefaultCategories.GAS
|
||||
seller_category_map['Target'] = DefaultCategories.SHOPPING
|
||||
|
||||
|
||||
class Categorizer(object):
|
||||
class Categorizer(object) :
|
||||
|
||||
def __init__(self, seller_category_map, seller_category_overrides_map):
|
||||
def __init__(self, seller_category_map, seller_category_overrides_map) :
|
||||
self.seller_category_map = seller_category_map
|
||||
self.seller_category_overrides_map = seller_category_overrides_map
|
||||
|
||||
def categorize(self, transaction):
|
||||
def categorize(self, transaction) :
|
||||
if transaction.seller in self.seller_category_map:
|
||||
return self.seller_category_map[transaction.seller]
|
||||
if transaction.seller in self.seller_category_overrides_map:
|
||||
seller_category_map[transaction.seller] = \
|
||||
self.manual_overrides[transaction.seller].peek_min()
|
||||
self.manual_overrides[transaction.seller].peek_min()
|
||||
return self.seller_category_map[transaction.seller]
|
||||
return None
|
||||
|
||||
|
||||
class Transaction(object):
|
||||
class Transaction(object) :
|
||||
|
||||
def __init__(self, timestamp, seller, amount):
|
||||
def __init__(self, timestamp, seller, amount) :
|
||||
self.timestamp = timestamp
|
||||
self.seller = seller
|
||||
self.amount = amount
|
||||
|
||||
|
||||
class Budget(object):
|
||||
class Budget(object) :
|
||||
|
||||
def __init__(self, template_categories_to_budget_map):
|
||||
def __init__(self, template_categories_to_budget_map) :
|
||||
self.categories_to_budget_map = template_categories_to_budget_map
|
||||
|
||||
def override_category_budget(self, category, amount):
|
||||
def override_category_budget(self, category, amount) :
|
||||
self.categories_to_budget_map[category] = amount
|
||||
|
@@ -1,6 +1,6 @@
|
||||
# 设计 Pastebin.com (或者 Bit.ly)
|
||||
# 设计 Pastebin.com (或者 Bit.ly)
|
||||
|
||||
**注意: 为了避免重复,当前文档会直接链接到[系统设计主题](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)的相关区域,请参考链接内容以获得综合的讨论点、权衡和替代方案。**
|
||||
**注意: 为了避免重复,当前文档会直接链接到[系统设计主题](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 的相关区域,请参考链接内容以获得综合的讨论点、权衡和替代方案。**
|
||||
|
||||
**设计 Bit.ly** - 是一个类似的问题,区别是 pastebin 需要存储的是 paste 的内容,而不是原始的未短化的 url。
|
||||
|
||||
@@ -61,7 +61,7 @@
|
||||
* `paste_path` - 255 bytes
|
||||
* 总共 = ~1.27 KB
|
||||
* 每个月新的 paste 内容在 12.7GB
|
||||
* (1.27 * 10000000)KB / 月的 paste
|
||||
* (1.27 * 10000000) KB / 月的 paste
|
||||
* 三年内将近 450GB 的新 paste 内容
|
||||
* 三年内 3.6 亿短链接
|
||||
* 假设大部分都是新的 paste,而不是需要更新已存在的 paste
|
||||
@@ -79,7 +79,7 @@
|
||||
|
||||
> 概述一个包括所有重要的组件的高层次设计
|
||||
|
||||

|
||||

|
||||
|
||||
## 第三步:设计核心组件
|
||||
|
||||
@@ -87,13 +87,13 @@
|
||||
|
||||
### 用例:用户输入一段文本,然后得到一个随机生成的链接
|
||||
|
||||
我们可以用一个 [关系型数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)作为一个大的哈希表,用来把生成的 url 映射到一个包含 paste 文件的文件服务器和路径上。
|
||||
我们可以用一个 [关系型数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms) 作为一个大的哈希表,用来把生成的 url 映射到一个包含 paste 文件的文件服务器和路径上。
|
||||
|
||||
为了避免托管一个文件服务器,我们可以用一个托管的**对象存储**,比如 Amazon 的 S3 或者[NoSQL 文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)。
|
||||
为了避免托管一个文件服务器,我们可以用一个托管的**对象存储**,比如 Amazon 的 S3 或者[NoSQL 文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储) 。
|
||||
|
||||
作为一个大的哈希表的关系型数据库的替代方案,我们可以用[NoSQL 键值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)。我们需要讨论[选择 SQL 或 NoSQL 之间的权衡](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。下面的讨论是使用关系型数据库方法。
|
||||
作为一个大的哈希表的关系型数据库的替代方案,我们可以用[NoSQL 键值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储) 。我们需要讨论[选择 SQL 或 NoSQL 之间的权衡](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql) 。下面的讨论是使用关系型数据库方法。
|
||||
|
||||
* **客户端** 发送一个创建 paste 的请求到作为一个[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)启动的 **Web 服务器**。
|
||||
* **客户端** 发送一个创建 paste 的请求到作为一个[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器) 启动的 **Web 服务器**。
|
||||
* **Web 服务器** 转发请求给 **写接口** 服务器
|
||||
* **写接口** 服务器执行如下操作:
|
||||
* 生成一个唯一的 url
|
||||
@@ -113,10 +113,10 @@ shortlink char(7) NOT NULL
|
||||
expiration_length_in_minutes int NOT NULL
|
||||
created_at datetime NOT NULL
|
||||
paste_path varchar(255) NOT NULL
|
||||
PRIMARY KEY(shortlink)
|
||||
PRIMARY KEY(shortlink)
|
||||
```
|
||||
|
||||
我们将在 `shortlink` 字段和 `created_at` 字段上创建一个[数据库索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#使用正确的索引),用来提高查询的速度(避免因为扫描全表导致的长时间查询)并将数据保存在内存中,从内存里面顺序读取 1MB 的数据需要大概 250 微秒,而从 SSD 上读取则需要花费 4 倍的时间,从硬盘上则需要花费 80 倍的时间。<sup><a href=https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数 > 1</a></sup>
|
||||
我们将在 `shortlink` 字段和 `created_at` 字段上创建一个[数据库索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#使用正确的索引) ,用来提高查询的速度(避免因为扫描全表导致的长时间查询)并将数据保存在内存中,从内存里面顺序读取 1MB 的数据需要大概 250 微秒,而从 SSD 上读取则需要花费 4 倍的时间,从硬盘上则需要花费 80 倍的时间。<sup><a href=https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数 > 1</a></sup>
|
||||
|
||||
为了生成唯一的 url,我们可以:
|
||||
|
||||
@@ -128,15 +128,15 @@ PRIMARY KEY(shortlink)
|
||||
* 对于 urls,使用 Base 62 编码 `[a-zA-Z0-9]` 是比较合适的
|
||||
* 对于每一个原始输入只会有一个 hash 结果,Base 62 是确定的(不涉及随机性)
|
||||
* Base 64 是另外一个流行的编码方案,但是对于 urls,会因为额外的 `+` 和 `-` 字符串而产生一些问题
|
||||
* 以下 [Base 62 伪代码](http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener) 执行的时间复杂度是 O(k),k 是数字的数量 = 7:
|
||||
* 以下 [Base 62 伪代码](http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener) 执行的时间复杂度是 O(k) ,k 是数字的数量 = 7:
|
||||
|
||||
```python
|
||||
def base_encode(num, base=62):
|
||||
def base_encode(num, base=62) :
|
||||
digits = []
|
||||
while num > 0
|
||||
remainder = modulo(num, base)
|
||||
digits.push(remainder)
|
||||
num = divide(num, base)
|
||||
remainder = modulo(num, base)
|
||||
digits.push(remainder)
|
||||
num = divide(num, base)
|
||||
digits = digits.reverse
|
||||
```
|
||||
|
||||
@@ -146,7 +146,7 @@ def base_encode(num, base=62):
|
||||
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
|
||||
```
|
||||
|
||||
我们将会用一个公开的 [**REST 风格接口**](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest):
|
||||
我们将会用一个公开的 [**REST 风格接口**](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest) :
|
||||
|
||||
```shell
|
||||
$ curl -X POST --data '{"expiration_length_in_minutes":"60", \"paste_contents":"Hello World!"}' https://pastebin.com/api/v1/paste
|
||||
@@ -160,7 +160,7 @@ Response:
|
||||
}
|
||||
```
|
||||
|
||||
用于内部通信,我们可以用 [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)。
|
||||
用于内部通信,我们可以用 [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc) 。
|
||||
|
||||
### 用例:用户输入一个 paste 的 url 后可以看到它存储的内容
|
||||
|
||||
@@ -192,36 +192,36 @@ Response:
|
||||
因为实时分析不是必须的,所以我们可以简单的 **MapReduce** **Web Server** 的日志,用来生成点击次数。
|
||||
|
||||
```python
|
||||
class HitCounts(MRJob):
|
||||
class HitCounts(MRJob) :
|
||||
|
||||
def extract_url(self, line):
|
||||
def extract_url(self, line) :
|
||||
"""Extract the generated url from the log line."""
|
||||
...
|
||||
|
||||
def extract_year_month(self, line):
|
||||
def extract_year_month(self, line) :
|
||||
"""Return the year and month portions of the timestamp."""
|
||||
...
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Emit key value pairs of the form:
|
||||
|
||||
(2016-01, url0), 1
|
||||
(2016-01, url0), 1
|
||||
(2016-01, url1), 1
|
||||
(2016-01, url0) , 1
|
||||
(2016-01, url0) , 1
|
||||
(2016-01, url1) , 1
|
||||
"""
|
||||
url = self.extract_url(line)
|
||||
period = self.extract_year_month(line)
|
||||
yield (period, url), 1
|
||||
url = self.extract_url(line)
|
||||
period = self.extract_year_month(line)
|
||||
yield (period, url) , 1
|
||||
|
||||
def reducer(self, key, values):
|
||||
def reducer(self, key, values) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(2016-01, url0), 2
|
||||
(2016-01, url1), 1
|
||||
(2016-01, url0) , 2
|
||||
(2016-01, url1) , 1
|
||||
"""
|
||||
yield key, sum(values)
|
||||
yield key, sum(values)
|
||||
```
|
||||
|
||||
### 用例: 服务删除过期的 pastes
|
||||
@@ -233,43 +233,43 @@ class HitCounts(MRJob):
|
||||
|
||||
> 给定约束条件,识别和解决瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要提示: 不要简单的从最初的设计直接跳到最终的设计**
|
||||
|
||||
说明您将迭代地执行这样的操作:1)**Benchmark/Load 测试**,2)**Profile** 出瓶颈,3)在评估替代方案和权衡时解决瓶颈,4)重复前面,可以参考[在 AWS 上设计一个可以支持百万用户的系统](../scaling_aws/README.md)这个用来解决如何迭代地扩展初始设计的例子。
|
||||
说明您将迭代地执行这样的操作:1) **Benchmark/Load 测试**,2) **Profile** 出瓶颈,3) 在评估替代方案和权衡时解决瓶颈,4) 重复前面,可以参考[在 AWS 上设计一个可以支持百万用户的系统](../scaling_aws/README.md) 这个用来解决如何迭代地扩展初始设计的例子。
|
||||
|
||||
重要的是讨论在初始设计中可能遇到的瓶颈,以及如何解决每个瓶颈。比如,在多个 **Web 服务器** 上添加 **负载平衡器** 可以解决哪些问题? **CDN** 解决哪些问题?**Master-Slave Replicas** 解决哪些问题? 替代方案是什么和怎么对每一个替代方案进行权衡比较?
|
||||
|
||||
我们将介绍一些组件来完成设计,并解决可伸缩性问题。内部的负载平衡器并不能减少杂乱。
|
||||
|
||||
**为了避免重复的讨论**, 参考以下[系统设计主题](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)获取主要讨论要点、权衡和替代方案:
|
||||
**为了避免重复的讨论**, 参考以下[系统设计主题](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 获取主要讨论要点、权衡和替代方案:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#内容分发网络cdn)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平扩展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [应用层](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#内容分发网络cdn)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平扩展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [应用层](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
|
||||
**分析存储数据库** 可以用比如 Amazon Redshift 或者 Google BigQuery 这样的数据仓库解决方案。
|
||||
|
||||
一个像 Amazon S3 这样的 **对象存储**,可以轻松处理每月 12.7 GB 的新内容约束。
|
||||
|
||||
要处理 *平均* 每秒 40 读请求(峰值更高),其中热点内容的流量应该由 **内存缓存** 处理,而不是数据库。**内存缓存** 对于处理分布不均匀的流量和流量峰值也很有用。只要副本没有陷入复制写的泥潭,**SQL Read Replicas** 应该能够处理缓存丢失。
|
||||
要处理 *平均* 每秒 40 读请求(峰值更高) ,其中热点内容的流量应该由 **内存缓存** 处理,而不是数据库。**内存缓存** 对于处理分布不均匀的流量和流量峰值也很有用。只要副本没有陷入复制写的泥潭,**SQL Read Replicas** 应该能够处理缓存丢失。
|
||||
|
||||
对于单个 **SQL Write Master-Slave**,*平均* 每秒 4paste 写入 (峰值更高) 应该是可以做到的。否则,我们需要使用额外的 SQL 扩展模式:
|
||||
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#SQL调优)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#SQL调优)
|
||||
|
||||
我们还应该考虑将一些数据移动到 **NoSQL 数据库**。
|
||||
|
||||
@@ -279,50 +279,50 @@ class HitCounts(MRJob):
|
||||
|
||||
### NoSQL
|
||||
|
||||
* [键值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [sql 还是 nosql](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
* [键值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [sql 还是 nosql](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 在哪缓存
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* 缓存什么
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* 何时更新缓存
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
|
||||
### 异步和微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
|
||||
### 通信
|
||||
|
||||
* 讨论权衡:
|
||||
* 跟客户端之间的外部通信 - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
* 跟客户端之间的外部通信 - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
|
||||
### 安全
|
||||
|
||||
参考[安全](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)。
|
||||
参考[安全](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 。
|
||||
|
||||
### 延迟数字
|
||||
|
||||
见[每个程序员都应该知道的延迟数](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
|
||||
见[每个程序员都应该知道的延迟数](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数) 。
|
||||
|
||||
### 持续进行
|
||||
|
||||
|
@@ -1,4 +1,4 @@
|
||||
# Design Pastebin.com (or Bit.ly)
|
||||
# Design Pastebin.com (or Bit.ly)
|
||||
|
||||
*Note: This document links directly to relevant areas found in the [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.*
|
||||
|
||||
@@ -79,7 +79,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -89,17 +89,17 @@ Handy conversion guide:
|
||||
|
||||
We could use a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) as a large hash table, mapping the generated url to a file server and path containing the paste file.
|
||||
|
||||
Instead of managing a file server, we could use a managed **Object Store** such as Amazon S3 or a [NoSQL document store](https://github.com/donnemartin/system-design-primer#document-store).
|
||||
Instead of managing a file server, we could use a managed **Object Store** such as Amazon S3 or a [NoSQL document store](https://github.com/donnemartin/system-design-primer#document-store) .
|
||||
|
||||
An alternative to a relational database acting as a large hash table, we could use a [NoSQL key-value store](https://github.com/donnemartin/system-design-primer#key-value-store). We should discuss the [tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql). The following discussion uses the relational database approach.
|
||||
An alternative to a relational database acting as a large hash table, we could use a [NoSQL key-value store](https://github.com/donnemartin/system-design-primer#key-value-store) . We should discuss the [tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) . The following discussion uses the relational database approach.
|
||||
|
||||
* The **Client** sends a create paste request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** sends a create paste request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Write API** server
|
||||
* The **Write API** server does the following:
|
||||
* Generates a unique url
|
||||
* Checks if the url is unique by looking at the **SQL Database** for a duplicate
|
||||
* If the url is not unique, it generates another url
|
||||
* If we supported a custom url, we could use the user-supplied (also check for a duplicate)
|
||||
* If we supported a custom url, we could use the user-supplied (also check for a duplicate)
|
||||
* Saves to the **SQL Database** `pastes` table
|
||||
* Saves the paste data to the **Object Store**
|
||||
* Returns the url
|
||||
@@ -113,7 +113,7 @@ shortlink char(7) NOT NULL
|
||||
expiration_length_in_minutes int NOT NULL
|
||||
created_at datetime NOT NULL
|
||||
paste_path varchar(255) NOT NULL
|
||||
PRIMARY KEY(shortlink)
|
||||
PRIMARY KEY(shortlink)
|
||||
```
|
||||
|
||||
Setting the primary key to be based on the `shortlink` column creates an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) that the database uses to enforce uniqueness. We'll create an additional index on `created_at` to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
@@ -126,17 +126,17 @@ To generate the unique url, we could:
|
||||
* Alternatively, we could also take the MD5 hash of randomly-generated data
|
||||
* [**Base 62**](https://www.kerstner.at/2012/07/shortening-strings-using-base-62-encoding/) encode the MD5 hash
|
||||
* Base 62 encodes to `[a-zA-Z0-9]` which works well for urls, eliminating the need for escaping special characters
|
||||
* There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
|
||||
* There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
|
||||
* Base 64 is another popular encoding but provides issues for urls because of the additional `+` and `/` characters
|
||||
* The following [Base 62 pseudocode](http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener) runs in O(k) time where k is the number of digits = 7:
|
||||
|
||||
```python
|
||||
def base_encode(num, base=62):
|
||||
def base_encode(num, base=62) :
|
||||
digits = []
|
||||
while num > 0
|
||||
remainder = modulo(num, base)
|
||||
digits.push(remainder)
|
||||
num = divide(num, base)
|
||||
remainder = modulo(num, base)
|
||||
digits.push(remainder)
|
||||
num = divide(num, base)
|
||||
digits = digits.reverse
|
||||
```
|
||||
|
||||
@@ -146,7 +146,7 @@ def base_encode(num, base=62):
|
||||
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
|
||||
```
|
||||
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
|
||||
@@ -161,7 +161,7 @@ Response:
|
||||
}
|
||||
```
|
||||
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
|
||||
|
||||
### Use case: User enters a paste's url and views the contents
|
||||
|
||||
@@ -195,36 +195,36 @@ Since realtime analytics are not a requirement, we could simply **MapReduce** th
|
||||
**Clarify with your interviewer how much code you are expected to write**.
|
||||
|
||||
```python
|
||||
class HitCounts(MRJob):
|
||||
class HitCounts(MRJob) :
|
||||
|
||||
def extract_url(self, line):
|
||||
def extract_url(self, line) :
|
||||
"""Extract the generated url from the log line."""
|
||||
...
|
||||
|
||||
def extract_year_month(self, line):
|
||||
def extract_year_month(self, line) :
|
||||
"""Return the year and month portions of the timestamp."""
|
||||
...
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Emit key value pairs of the form:
|
||||
|
||||
(2016-01, url0), 1
|
||||
(2016-01, url0), 1
|
||||
(2016-01, url1), 1
|
||||
(2016-01, url0) , 1
|
||||
(2016-01, url0) , 1
|
||||
(2016-01, url1) , 1
|
||||
"""
|
||||
url = self.extract_url(line)
|
||||
period = self.extract_year_month(line)
|
||||
yield (period, url), 1
|
||||
url = self.extract_url(line)
|
||||
period = self.extract_year_month(line)
|
||||
yield (period, url) , 1
|
||||
|
||||
def reducer(self, key, values):
|
||||
def reducer(self, key, values) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(2016-01, url0), 2
|
||||
(2016-01, url1), 1
|
||||
(2016-01, url0) , 2
|
||||
(2016-01, url1) , 1
|
||||
"""
|
||||
yield key, sum(values)
|
||||
yield key, sum(values)
|
||||
```
|
||||
|
||||
### Use case: Service deletes expired pastes
|
||||
@@ -235,7 +235,7 @@ To delete expired pastes, we could just scan the **SQL Database** for all entrie
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -247,31 +247,31 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
The **Analytics Database** could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
|
||||
|
||||
An **Object Store** such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
|
||||
|
||||
To address the 40 *average* read requests per second (higher at peak), traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
|
||||
To address the 40 *average* read requests per second (higher at peak) , traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
|
||||
|
||||
4 *average* paste writes per second (with higher at peak) should be do-able for a single **SQL Write Master-Slave**. Otherwise, we'll need to employ additional SQL scaling patterns:
|
||||
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
@@ -281,50 +281,50 @@ We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -3,44 +3,44 @@
|
||||
from mrjob.job import MRJob
|
||||
|
||||
|
||||
class HitCounts(MRJob):
|
||||
class HitCounts(MRJob) :
|
||||
|
||||
def extract_url(self, line):
|
||||
def extract_url(self, line) :
|
||||
"""Extract the generated url from the log line."""
|
||||
pass
|
||||
|
||||
def extract_year_month(self, line):
|
||||
def extract_year_month(self, line) :
|
||||
"""Return the year and month portions of the timestamp."""
|
||||
pass
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Emit key value pairs of the form:
|
||||
|
||||
(2016-01, url0), 1
|
||||
(2016-01, url0), 1
|
||||
(2016-01, url1), 1
|
||||
(2016-01, url0) , 1
|
||||
(2016-01, url0) , 1
|
||||
(2016-01, url1) , 1
|
||||
"""
|
||||
url = self.extract_url(line)
|
||||
period = self.extract_year_month(line)
|
||||
yield (period, url), 1
|
||||
url = self.extract_url(line)
|
||||
period = self.extract_year_month(line)
|
||||
yield (period, url) , 1
|
||||
|
||||
def reducer(self, key, values):
|
||||
def reducer(self, key, values) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(2016-01, url0), 2
|
||||
(2016-01, url1), 1
|
||||
(2016-01, url0) , 2
|
||||
(2016-01, url1) , 1
|
||||
"""
|
||||
yield key, sum(values)
|
||||
yield key, sum(values)
|
||||
|
||||
def steps(self):
|
||||
def steps(self) :
|
||||
"""Run the map and reduce steps."""
|
||||
return [
|
||||
self.mr(mapper=self.mapper,
|
||||
reducer=self.reducer)
|
||||
reducer=self.reducer)
|
||||
]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
HitCounts.run()
|
||||
HitCounts.run()
|
||||
|
@@ -1,6 +1,6 @@
|
||||
# 设计一个键-值缓存来存储最近 web 服务查询的结果
|
||||
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
|
||||
## 第一步:简述用例与约束条件
|
||||
|
||||
@@ -58,7 +58,7 @@
|
||||
|
||||
> 列出所有重要组件以规划概要设计。
|
||||
|
||||

|
||||

|
||||
|
||||
## 第三步:设计核心组件
|
||||
|
||||
@@ -70,7 +70,7 @@
|
||||
|
||||
由于缓存容量有限,我们将使用 LRU(近期最少使用算法)来控制缓存的过期。
|
||||
|
||||
* **客户端**向运行[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)的 **Web 服务器**发送一个请求
|
||||
* **客户端**向运行[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器) 的 **Web 服务器**发送一个请求
|
||||
* 这个 **Web 服务器**将请求转发给**查询 API** 服务
|
||||
* **查询 API** 服务将会做这些事情:
|
||||
* 分析查询
|
||||
@@ -98,33 +98,33 @@
|
||||
实现**查询 API 服务**:
|
||||
|
||||
```python
|
||||
class QueryApi(object):
|
||||
class QueryApi(object) :
|
||||
|
||||
def __init__(self, memory_cache, reverse_index_service):
|
||||
def __init__(self, memory_cache, reverse_index_service) :
|
||||
self.memory_cache = memory_cache
|
||||
self.reverse_index_service = reverse_index_service
|
||||
|
||||
def parse_query(self, query):
|
||||
def parse_query(self, query) :
|
||||
"""移除多余内容,将文本分割成词组,修复拼写错误,
|
||||
规范化字母大小写,转换布尔运算。
|
||||
"""
|
||||
...
|
||||
|
||||
def process_query(self, query):
|
||||
query = self.parse_query(query)
|
||||
results = self.memory_cache.get(query)
|
||||
def process_query(self, query) :
|
||||
query = self.parse_query(query)
|
||||
results = self.memory_cache.get(query)
|
||||
if results is None:
|
||||
results = self.reverse_index_service.process_search(query)
|
||||
self.memory_cache.set(query, results)
|
||||
results = self.reverse_index_service.process_search(query)
|
||||
self.memory_cache.set(query, results)
|
||||
return results
|
||||
```
|
||||
|
||||
实现**节点**:
|
||||
|
||||
```python
|
||||
class Node(object):
|
||||
class Node(object) :
|
||||
|
||||
def __init__(self, query, results):
|
||||
def __init__(self, query, results) :
|
||||
self.query = query
|
||||
self.results = results
|
||||
```
|
||||
@@ -132,34 +132,34 @@ class Node(object):
|
||||
实现**链表**:
|
||||
|
||||
```python
|
||||
class LinkedList(object):
|
||||
class LinkedList(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.head = None
|
||||
self.tail = None
|
||||
|
||||
def move_to_front(self, node):
|
||||
def move_to_front(self, node) :
|
||||
...
|
||||
|
||||
def append_to_front(self, node):
|
||||
def append_to_front(self, node) :
|
||||
...
|
||||
|
||||
def remove_from_tail(self):
|
||||
def remove_from_tail(self) :
|
||||
...
|
||||
```
|
||||
|
||||
实现**缓存**:
|
||||
|
||||
```python
|
||||
class Cache(object):
|
||||
class Cache(object) :
|
||||
|
||||
def __init__(self, MAX_SIZE):
|
||||
def __init__(self, MAX_SIZE) :
|
||||
self.MAX_SIZE = MAX_SIZE
|
||||
self.size = 0
|
||||
self.lookup = {} # key: query, value: node
|
||||
self.linked_list = LinkedList()
|
||||
self.linked_list = LinkedList()
|
||||
|
||||
def get(self, query)
|
||||
def get(self, query)
|
||||
"""从缓存取得存储的内容
|
||||
|
||||
将入口节点位置更新为 LRU 链表的头部。
|
||||
@@ -167,10 +167,10 @@ class Cache(object):
|
||||
node = self.lookup[query]
|
||||
if node is None:
|
||||
return None
|
||||
self.linked_list.move_to_front(node)
|
||||
self.linked_list.move_to_front(node)
|
||||
return node.results
|
||||
|
||||
def set(self, results, query):
|
||||
def set(self, results, query) :
|
||||
"""将所给查询键的结果存在缓存中。
|
||||
|
||||
当更新缓存记录的时候,将它的位置指向 LRU 链表的头部。
|
||||
@@ -181,18 +181,18 @@ class Cache(object):
|
||||
if node is not None:
|
||||
# 键存在于缓存中,更新它对应的值
|
||||
node.results = results
|
||||
self.linked_list.move_to_front(node)
|
||||
self.linked_list.move_to_front(node)
|
||||
else:
|
||||
# 键不存在于缓存中
|
||||
if self.size == self.MAX_SIZE:
|
||||
# 在链表中查找并删除最老的记录
|
||||
self.lookup.pop(self.linked_list.tail.query, None)
|
||||
self.linked_list.remove_from_tail()
|
||||
self.lookup.pop(self.linked_list.tail.query, None)
|
||||
self.linked_list.remove_from_tail()
|
||||
else:
|
||||
self.size += 1
|
||||
# 添加新的键值对
|
||||
new_node = Node(query, results)
|
||||
self.linked_list.append_to_front(new_node)
|
||||
new_node = Node(query, results)
|
||||
self.linked_list.append_to_front(new_node)
|
||||
self.lookup[query] = new_node
|
||||
```
|
||||
|
||||
@@ -206,13 +206,13 @@ class Cache(object):
|
||||
|
||||
解决这些问题的最直接的方法,就是为缓存记录设置一个它在被更新前能留在缓存中的最长时间,这个时间简称为存活时间(TTL)。
|
||||
|
||||
参考 [「何时更新缓存」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#何时更新缓存)来了解其权衡取舍及替代方案。以上方法在[缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)一章中详细地进行了描述。
|
||||
参考 [「何时更新缓存」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#何时更新缓存) 来了解其权衡取舍及替代方案。以上方法在[缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式) 一章中详细地进行了描述。
|
||||
|
||||
## 第四步:架构扩展
|
||||
|
||||
> 根据限制条件,找到并解决瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要提示:不要从最初设计直接跳到最终设计中!**
|
||||
|
||||
@@ -222,16 +222,16 @@ class Cache(object):
|
||||
|
||||
我们将会介绍一些组件来完成设计,并解决架构扩张问题。内置的负载均衡器将不做讨论以节省篇幅。
|
||||
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
|
||||
### 将内存缓存扩大到多台机器
|
||||
|
||||
@@ -239,7 +239,7 @@ class Cache(object):
|
||||
|
||||
* **缓存集群中的每一台机器都有自己的缓存** - 简单,但是它会降低缓存命中率。
|
||||
* **缓存集群中的每一台机器都有缓存的拷贝** - 简单,但是它的内存使用效率太低了。
|
||||
* **对缓存进行[分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片),分别部署在缓存集群中的所有机器中** - 更加复杂,但是它是最佳的选择。我们可以使用哈希,用查询语句 `machine = hash(query)` 来确定哪台机器有需要缓存。当然我们也可以使用[一致性哈希](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#正在完善中)。
|
||||
* **对缓存进行[分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片) ,分别部署在缓存集群中的所有机器中** - 更加复杂,但是它是最佳的选择。我们可以使用哈希,用查询语句 `machine = hash(query) ` 来确定哪台机器有需要缓存。当然我们也可以使用[一致性哈希](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#正在完善中) 。
|
||||
|
||||
## 其它要点
|
||||
|
||||
@@ -247,58 +247,58 @@ class Cache(object):
|
||||
|
||||
### SQL 缩放模式
|
||||
|
||||
* [读取复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
* [读取复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 在哪缓存
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* 什么需要缓存
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* 何时更新缓存
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
|
||||
### 异步与微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
|
||||
### 通信
|
||||
|
||||
* 可权衡选择的方案:
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
|
||||
### 安全性
|
||||
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)一章。
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 一章。
|
||||
|
||||
### 延迟数值
|
||||
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数) 。
|
||||
|
||||
### 持续探讨
|
||||
|
||||
|
@@ -58,7 +58,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -70,7 +70,7 @@ Popular queries can be served from a **Memory Cache** such as Redis or Memcached
|
||||
|
||||
Since the cache has limited capacity, we'll use a least recently used (LRU) approach to expire older entries.
|
||||
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Query API** server
|
||||
* The **Query API** server does the following:
|
||||
* Parses the query
|
||||
@@ -98,33 +98,33 @@ The cache can use a doubly-linked list: new items will be added to the head whil
|
||||
**Query API Server** implementation:
|
||||
|
||||
```python
|
||||
class QueryApi(object):
|
||||
class QueryApi(object) :
|
||||
|
||||
def __init__(self, memory_cache, reverse_index_service):
|
||||
def __init__(self, memory_cache, reverse_index_service) :
|
||||
self.memory_cache = memory_cache
|
||||
self.reverse_index_service = reverse_index_service
|
||||
|
||||
def parse_query(self, query):
|
||||
def parse_query(self, query) :
|
||||
"""Remove markup, break text into terms, deal with typos,
|
||||
normalize capitalization, convert to use boolean operations.
|
||||
"""
|
||||
...
|
||||
|
||||
def process_query(self, query):
|
||||
query = self.parse_query(query)
|
||||
results = self.memory_cache.get(query)
|
||||
def process_query(self, query) :
|
||||
query = self.parse_query(query)
|
||||
results = self.memory_cache.get(query)
|
||||
if results is None:
|
||||
results = self.reverse_index_service.process_search(query)
|
||||
self.memory_cache.set(query, results)
|
||||
results = self.reverse_index_service.process_search(query)
|
||||
self.memory_cache.set(query, results)
|
||||
return results
|
||||
```
|
||||
|
||||
**Node** implementation:
|
||||
|
||||
```python
|
||||
class Node(object):
|
||||
class Node(object) :
|
||||
|
||||
def __init__(self, query, results):
|
||||
def __init__(self, query, results) :
|
||||
self.query = query
|
||||
self.results = results
|
||||
```
|
||||
@@ -132,34 +132,34 @@ class Node(object):
|
||||
**LinkedList** implementation:
|
||||
|
||||
```python
|
||||
class LinkedList(object):
|
||||
class LinkedList(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.head = None
|
||||
self.tail = None
|
||||
|
||||
def move_to_front(self, node):
|
||||
def move_to_front(self, node) :
|
||||
...
|
||||
|
||||
def append_to_front(self, node):
|
||||
def append_to_front(self, node) :
|
||||
...
|
||||
|
||||
def remove_from_tail(self):
|
||||
def remove_from_tail(self) :
|
||||
...
|
||||
```
|
||||
|
||||
**Cache** implementation:
|
||||
|
||||
```python
|
||||
class Cache(object):
|
||||
class Cache(object) :
|
||||
|
||||
def __init__(self, MAX_SIZE):
|
||||
def __init__(self, MAX_SIZE) :
|
||||
self.MAX_SIZE = MAX_SIZE
|
||||
self.size = 0
|
||||
self.lookup = {} # key: query, value: node
|
||||
self.linked_list = LinkedList()
|
||||
self.linked_list = LinkedList()
|
||||
|
||||
def get(self, query)
|
||||
def get(self, query)
|
||||
"""Get the stored query result from the cache.
|
||||
|
||||
Accessing a node updates its position to the front of the LRU list.
|
||||
@@ -167,10 +167,10 @@ class Cache(object):
|
||||
node = self.lookup[query]
|
||||
if node is None:
|
||||
return None
|
||||
self.linked_list.move_to_front(node)
|
||||
self.linked_list.move_to_front(node)
|
||||
return node.results
|
||||
|
||||
def set(self, results, query):
|
||||
def set(self, results, query) :
|
||||
"""Set the result for the given query key in the cache.
|
||||
|
||||
When updating an entry, updates its position to the front of the LRU list.
|
||||
@@ -181,18 +181,18 @@ class Cache(object):
|
||||
if node is not None:
|
||||
# Key exists in cache, update the value
|
||||
node.results = results
|
||||
self.linked_list.move_to_front(node)
|
||||
self.linked_list.move_to_front(node)
|
||||
else:
|
||||
# Key does not exist in cache
|
||||
if self.size == self.MAX_SIZE:
|
||||
# Remove the oldest entry from the linked list and lookup
|
||||
self.lookup.pop(self.linked_list.tail.query, None)
|
||||
self.linked_list.remove_from_tail()
|
||||
self.lookup.pop(self.linked_list.tail.query, None)
|
||||
self.linked_list.remove_from_tail()
|
||||
else:
|
||||
self.size += 1
|
||||
# Add the new key and value
|
||||
new_node = Node(query, results)
|
||||
self.linked_list.append_to_front(new_node)
|
||||
new_node = Node(query, results)
|
||||
self.linked_list.append_to_front(new_node)
|
||||
self.lookup[query] = new_node
|
||||
```
|
||||
|
||||
@@ -204,15 +204,15 @@ The cache should be updated when:
|
||||
* The page is removed or a new page is added
|
||||
* The page rank changes
|
||||
|
||||
The most straightforward way to handle these cases is to simply set a max time that a cached entry can stay in the cache before it is updated, usually referred to as time to live (TTL).
|
||||
The most straightforward way to handle these cases is to simply set a max time that a cached entry can stay in the cache before it is updated, usually referred to as time to live (TTL) .
|
||||
|
||||
Refer to [When to update the cache](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) for tradeoffs and alternatives. The approach above describes [cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside).
|
||||
Refer to [When to update the cache](https://github.com/donnemartin/system-design-primer#when-to-update-the-cache) for tradeoffs and alternatives. The approach above describes [cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside) .
|
||||
|
||||
## Step 4: Scale the design
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -224,14 +224,14 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
### Expanding the Memory Cache to many machines
|
||||
|
||||
@@ -239,7 +239,7 @@ To handle the heavy request load and the large amount of memory needed, we'll sc
|
||||
|
||||
* **Each machine in the cache cluster has its own cache** - Simple, although it will likely result in a low cache hit rate.
|
||||
* **Each machine in the cache cluster has a copy of the cache** - Simple, although it is an inefficient use of memory.
|
||||
* **The cache is [sharded](https://github.com/donnemartin/system-design-primer#sharding) across all machines in the cache cluster** - More complex, although it is likely the best option. We could use hashing to determine which machine could have the cached results of a query using `machine = hash(query)`. We'll likely want to use [consistent hashing](https://github.com/donnemartin/system-design-primer#under-development).
|
||||
* **The cache is [sharded](https://github.com/donnemartin/system-design-primer#sharding) across all machines in the cache cluster** - More complex, although it is likely the best option. We could use hashing to determine which machine could have the cached results of a query using `machine = hash(query) `. We'll likely want to use [consistent hashing](https://github.com/donnemartin/system-design-primer#under-development) .
|
||||
|
||||
## Additional talking points
|
||||
|
||||
@@ -247,58 +247,58 @@ To handle the heavy request load and the large amount of memory needed, we'll sc
|
||||
|
||||
### SQL scaling patterns
|
||||
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -1,59 +1,59 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
|
||||
class QueryApi(object):
|
||||
class QueryApi(object) :
|
||||
|
||||
def __init__(self, memory_cache, reverse_index_cluster):
|
||||
def __init__(self, memory_cache, reverse_index_cluster) :
|
||||
self.memory_cache = memory_cache
|
||||
self.reverse_index_cluster = reverse_index_cluster
|
||||
|
||||
def parse_query(self, query):
|
||||
def parse_query(self, query) :
|
||||
"""Remove markup, break text into terms, deal with typos,
|
||||
normalize capitalization, convert to use boolean operations.
|
||||
"""
|
||||
...
|
||||
|
||||
def process_query(self, query):
|
||||
query = self.parse_query(query)
|
||||
results = self.memory_cache.get(query)
|
||||
def process_query(self, query) :
|
||||
query = self.parse_query(query)
|
||||
results = self.memory_cache.get(query)
|
||||
if results is None:
|
||||
results = self.reverse_index_cluster.process_search(query)
|
||||
self.memory_cache.set(query, results)
|
||||
results = self.reverse_index_cluster.process_search(query)
|
||||
self.memory_cache.set(query, results)
|
||||
return results
|
||||
|
||||
|
||||
class Node(object):
|
||||
class Node(object) :
|
||||
|
||||
def __init__(self, query, results):
|
||||
def __init__(self, query, results) :
|
||||
self.query = query
|
||||
self.results = results
|
||||
|
||||
|
||||
class LinkedList(object):
|
||||
class LinkedList(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.head = None
|
||||
self.tail = None
|
||||
|
||||
def move_to_front(self, node):
|
||||
def move_to_front(self, node) :
|
||||
...
|
||||
|
||||
def append_to_front(self, node):
|
||||
def append_to_front(self, node) :
|
||||
...
|
||||
|
||||
def remove_from_tail(self):
|
||||
def remove_from_tail(self) :
|
||||
...
|
||||
|
||||
|
||||
class Cache(object):
|
||||
class Cache(object) :
|
||||
|
||||
def __init__(self, MAX_SIZE):
|
||||
def __init__(self, MAX_SIZE) :
|
||||
self.MAX_SIZE = MAX_SIZE
|
||||
self.size = 0
|
||||
self.lookup = {}
|
||||
self.linked_list = LinkedList()
|
||||
self.linked_list = LinkedList()
|
||||
|
||||
def get(self, query):
|
||||
def get(self, query) :
|
||||
"""Get the stored query result from the cache.
|
||||
|
||||
Accessing a node updates its position to the front of the LRU list.
|
||||
@@ -61,10 +61,10 @@ class Cache(object):
|
||||
node = self.lookup[query]
|
||||
if node is None:
|
||||
return None
|
||||
self.linked_list.move_to_front(node)
|
||||
self.linked_list.move_to_front(node)
|
||||
return node.results
|
||||
|
||||
def set(self, results, query):
|
||||
def set(self, results, query) :
|
||||
"""Set the result for the given query key in the cache.
|
||||
|
||||
When updating an entry, updates its position to the front of the LRU list.
|
||||
@@ -75,16 +75,16 @@ class Cache(object):
|
||||
if node is not None:
|
||||
# Key exists in cache, update the value
|
||||
node.results = results
|
||||
self.linked_list.move_to_front(node)
|
||||
self.linked_list.move_to_front(node)
|
||||
else:
|
||||
# Key does not exist in cache
|
||||
if self.size == self.MAX_SIZE:
|
||||
# Remove the oldest entry from the linked list and lookup
|
||||
self.lookup.pop(self.linked_list.tail.query, None)
|
||||
self.linked_list.remove_from_tail()
|
||||
self.lookup.pop(self.linked_list.tail.query, None)
|
||||
self.linked_list.remove_from_tail()
|
||||
else:
|
||||
self.size += 1
|
||||
# Add the new key and value
|
||||
new_node = Node(query, results)
|
||||
self.linked_list.append_to_front(new_node)
|
||||
new_node = Node(query, results)
|
||||
self.linked_list.append_to_front(new_node)
|
||||
self.lookup[query] = new_node
|
||||
|
@@ -1,6 +1,6 @@
|
||||
# 为 Amazon 设计分类售卖排行
|
||||
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
|
||||
## 第一步:简述用例与约束条件
|
||||
|
||||
@@ -70,7 +70,7 @@
|
||||
|
||||
> 列出所有重要组件以规划概要设计。
|
||||
|
||||

|
||||

|
||||
|
||||
## 第三步:设计核心组件
|
||||
|
||||
@@ -95,94 +95,94 @@ t5 product4 category1 1 5.00 5 6
|
||||
...
|
||||
```
|
||||
|
||||
**售卖排行服务** 需要用到 **MapReduce**,并使用 **售卖 API** 服务进行日志记录,同时将结果写入 **SQL 数据库**中的总表 `sales_rank` 中。我们也可以讨论一下[究竟是用 SQL 还是用 NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。
|
||||
**售卖排行服务** 需要用到 **MapReduce**,并使用 **售卖 API** 服务进行日志记录,同时将结果写入 **SQL 数据库**中的总表 `sales_rank` 中。我们也可以讨论一下[究竟是用 SQL 还是用 NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql) 。
|
||||
|
||||
我们需要通过以下步骤使用 **MapReduce**:
|
||||
|
||||
* **第 1 步** - 将数据转换为 `(category, product_id), sum(quantity)` 的形式
|
||||
* **第 1 步** - 将数据转换为 `(category, product_id) , sum(quantity) ` 的形式
|
||||
* **第 2 步** - 执行分布式排序
|
||||
|
||||
```python
|
||||
class SalesRanker(MRJob):
|
||||
class SalesRanker(MRJob) :
|
||||
|
||||
def within_past_week(self, timestamp):
|
||||
def within_past_week(self, timestamp) :
|
||||
"""如果时间戳属于过去的一周则返回 True,
|
||||
否则返回 False。"""
|
||||
...
|
||||
|
||||
def mapper(self, _ line):
|
||||
def mapper(self, _ line) :
|
||||
"""解析日志的每一行,提取并转换相关行,
|
||||
|
||||
将键值对设定为如下形式:
|
||||
|
||||
(category1, product1), 2
|
||||
(category2, product1), 2
|
||||
(category2, product1), 1
|
||||
(category1, product2), 3
|
||||
(category2, product3), 7
|
||||
(category1, product4), 1
|
||||
(category1, product1) , 2
|
||||
(category2, product1) , 2
|
||||
(category2, product1) , 1
|
||||
(category1, product2) , 3
|
||||
(category2, product3) , 7
|
||||
(category1, product4) , 1
|
||||
"""
|
||||
timestamp, product_id, category_id, quantity, total_price, seller_id, \
|
||||
buyer_id = line.split('\t')
|
||||
if self.within_past_week(timestamp):
|
||||
yield (category_id, product_id), quantity
|
||||
buyer_id = line.split('\t')
|
||||
if self.within_past_week(timestamp) :
|
||||
yield (category_id, product_id) , quantity
|
||||
|
||||
def reducer(self, key, value):
|
||||
def reducer(self, key, value) :
|
||||
"""将每个 key 的值加起来。
|
||||
|
||||
(category1, product1), 2
|
||||
(category2, product1), 3
|
||||
(category1, product2), 3
|
||||
(category2, product3), 7
|
||||
(category1, product4), 1
|
||||
(category1, product1) , 2
|
||||
(category2, product1) , 3
|
||||
(category1, product2) , 3
|
||||
(category2, product3) , 7
|
||||
(category1, product4) , 1
|
||||
"""
|
||||
yield key, sum(values)
|
||||
yield key, sum(values)
|
||||
|
||||
def mapper_sort(self, key, value):
|
||||
def mapper_sort(self, key, value) :
|
||||
"""构造 key 以确保正确的排序。
|
||||
|
||||
将键值对转换成如下形式:
|
||||
|
||||
(category1, 2), product1
|
||||
(category2, 3), product1
|
||||
(category1, 3), product2
|
||||
(category2, 7), product3
|
||||
(category1, 1), product4
|
||||
(category1, 2) , product1
|
||||
(category2, 3) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 7) , product3
|
||||
(category1, 1) , product4
|
||||
|
||||
MapReduce 的随机排序步骤会将键
|
||||
值的排序打乱,变成下面这样:
|
||||
|
||||
(category1, 1), product4
|
||||
(category1, 2), product1
|
||||
(category1, 3), product2
|
||||
(category2, 3), product1
|
||||
(category2, 7), product3
|
||||
(category1, 1) , product4
|
||||
(category1, 2) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 3) , product1
|
||||
(category2, 7) , product3
|
||||
"""
|
||||
category_id, product_id = key
|
||||
quantity = value
|
||||
yield (category_id, quantity), product_id
|
||||
yield (category_id, quantity) , product_id
|
||||
|
||||
def reducer_identity(self, key, value):
|
||||
def reducer_identity(self, key, value) :
|
||||
yield key, value
|
||||
|
||||
def steps(self):
|
||||
def steps(self) :
|
||||
""" 此处为 map reduce 步骤"""
|
||||
return [
|
||||
self.mr(mapper=self.mapper,
|
||||
reducer=self.reducer),
|
||||
reducer=self.reducer) ,
|
||||
self.mr(mapper=self.mapper_sort,
|
||||
reducer=self.reducer_identity),
|
||||
reducer=self.reducer_identity) ,
|
||||
]
|
||||
```
|
||||
|
||||
得到的结果将会是如下的排序列,我们将其插入 `sales_rank` 表中:
|
||||
|
||||
```
|
||||
(category1, 1), product4
|
||||
(category1, 2), product1
|
||||
(category1, 3), product2
|
||||
(category2, 3), product1
|
||||
(category2, 7), product3
|
||||
(category1, 1) , product4
|
||||
(category1, 2) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 3) , product1
|
||||
(category2, 7) , product3
|
||||
```
|
||||
|
||||
`sales_rank` 表的数据结构如下:
|
||||
@@ -192,20 +192,20 @@ id int NOT NULL AUTO_INCREMENT
|
||||
category_id int NOT NULL
|
||||
total_sold int NOT NULL
|
||||
product_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(category_id) REFERENCES Categories(id)
|
||||
FOREIGN KEY(product_id) REFERENCES Products(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(category_id) REFERENCES Categories(id)
|
||||
FOREIGN KEY(product_id) REFERENCES Products(id)
|
||||
```
|
||||
|
||||
我们会以 `id`、`category_id` 与 `product_id` 创建一个 [索引](https://github.com/donnemartin/system-design-primer#use-good-indices)以加快查询速度(只需要使用读取日志的时间,不再需要每次都扫描整个数据表)并让数据常驻内存。从内存读取 1 MB 连续数据大约要花 250 微秒,而从 SSD 读取同样大小的数据要花费 4 倍的时间,从机械硬盘读取需要花费 80 倍以上的时间。<sup><a href=https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数>1</a></sup>
|
||||
我们会以 `id`、`category_id` 与 `product_id` 创建一个 [索引](https://github.com/donnemartin/system-design-primer#use-good-indices) 以加快查询速度(只需要使用读取日志的时间,不再需要每次都扫描整个数据表)并让数据常驻内存。从内存读取 1 MB 连续数据大约要花 250 微秒,而从 SSD 读取同样大小的数据要花费 4 倍的时间,从机械硬盘读取需要花费 80 倍以上的时间。<sup><a href=https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数>1</a></sup>
|
||||
|
||||
### 用例:用户需要根据分类浏览上周中最受欢迎的商品
|
||||
|
||||
* **客户端**向运行[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)的 **Web 服务器**发送一个请求
|
||||
* **客户端**向运行[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器) 的 **Web 服务器**发送一个请求
|
||||
* 这个 **Web 服务器**将请求转发给**查询 API** 服务
|
||||
* The **查询 API** 服务将从 **SQL 数据库**的 `sales_rank` 表中读取数据
|
||||
|
||||
我们可以调用一个公共的 [REST API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest):
|
||||
我们可以调用一个公共的 [REST API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest) :
|
||||
|
||||
```
|
||||
$ curl https://amazon.com/api/v1/popular?category_id=1234
|
||||
@@ -234,13 +234,13 @@ $ curl https://amazon.com/api/v1/popular?category_id=1234
|
||||
},
|
||||
```
|
||||
|
||||
而对于服务器内部的通信,我们可以使用 [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)。
|
||||
而对于服务器内部的通信,我们可以使用 [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc) 。
|
||||
|
||||
## 第四步:架构扩展
|
||||
|
||||
> 根据限制条件,找到并解决瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要提示:不要从最初设计直接跳到最终设计中!**
|
||||
|
||||
@@ -250,19 +250,19 @@ $ curl https://amazon.com/api/v1/popular?category_id=1234
|
||||
|
||||
我们将会介绍一些组件来完成设计,并解决架构扩张问题。内置的负载均衡器将不做讨论以节省篇幅。
|
||||
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
|
||||
**分析数据库** 可以用现成的数据仓储系统,例如使用 Amazon Redshift 或者 Google BigQuery 的解决方案。
|
||||
|
||||
@@ -274,10 +274,10 @@ $ curl https://amazon.com/api/v1/popular?category_id=1234
|
||||
|
||||
SQL 缩放模式包括:
|
||||
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
|
||||
我们也可以考虑将一些数据移至 **NoSQL 数据库**。
|
||||
|
||||
@@ -287,50 +287,50 @@ SQL 缩放模式包括:
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 在哪缓存
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* 什么需要缓存
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* 何时更新缓存
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
|
||||
### 异步与微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
|
||||
### 通信
|
||||
|
||||
* 可权衡选择的方案:
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
|
||||
### 安全性
|
||||
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)一章。
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 一章。
|
||||
|
||||
### 延迟数值
|
||||
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数) 。
|
||||
|
||||
### 持续探讨
|
||||
|
||||
|
@@ -70,7 +70,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -95,93 +95,93 @@ t5 product4 category1 1 5.00 5 6
|
||||
...
|
||||
```
|
||||
|
||||
The **Sales Rank Service** could use **MapReduce**, using the **Sales API** server log files as input and writing the results to an aggregate table `sales_rank` in a **SQL Database**. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
|
||||
The **Sales Rank Service** could use **MapReduce**, using the **Sales API** server log files as input and writing the results to an aggregate table `sales_rank` in a **SQL Database**. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) .
|
||||
|
||||
We'll use a multi-step **MapReduce**:
|
||||
|
||||
* **Step 1** - Transform the data to `(category, product_id), sum(quantity)`
|
||||
* **Step 1** - Transform the data to `(category, product_id) , sum(quantity) `
|
||||
* **Step 2** - Perform a distributed sort
|
||||
|
||||
```python
|
||||
class SalesRanker(MRJob):
|
||||
class SalesRanker(MRJob) :
|
||||
|
||||
def within_past_week(self, timestamp):
|
||||
def within_past_week(self, timestamp) :
|
||||
"""Return True if timestamp is within past week, False otherwise."""
|
||||
...
|
||||
|
||||
def mapper(self, _ line):
|
||||
def mapper(self, _ line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Emit key value pairs of the form:
|
||||
|
||||
(category1, product1), 2
|
||||
(category2, product1), 2
|
||||
(category2, product1), 1
|
||||
(category1, product2), 3
|
||||
(category2, product3), 7
|
||||
(category1, product4), 1
|
||||
(category1, product1) , 2
|
||||
(category2, product1) , 2
|
||||
(category2, product1) , 1
|
||||
(category1, product2) , 3
|
||||
(category2, product3) , 7
|
||||
(category1, product4) , 1
|
||||
"""
|
||||
timestamp, product_id, category_id, quantity, total_price, seller_id, \
|
||||
buyer_id = line.split('\t')
|
||||
if self.within_past_week(timestamp):
|
||||
yield (category_id, product_id), quantity
|
||||
buyer_id = line.split('\t')
|
||||
if self.within_past_week(timestamp) :
|
||||
yield (category_id, product_id) , quantity
|
||||
|
||||
def reducer(self, key, value):
|
||||
def reducer(self, key, value) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(category1, product1), 2
|
||||
(category2, product1), 3
|
||||
(category1, product2), 3
|
||||
(category2, product3), 7
|
||||
(category1, product4), 1
|
||||
(category1, product1) , 2
|
||||
(category2, product1) , 3
|
||||
(category1, product2) , 3
|
||||
(category2, product3) , 7
|
||||
(category1, product4) , 1
|
||||
"""
|
||||
yield key, sum(values)
|
||||
yield key, sum(values)
|
||||
|
||||
def mapper_sort(self, key, value):
|
||||
def mapper_sort(self, key, value) :
|
||||
"""Construct key to ensure proper sorting.
|
||||
|
||||
Transform key and value to the form:
|
||||
|
||||
(category1, 2), product1
|
||||
(category2, 3), product1
|
||||
(category1, 3), product2
|
||||
(category2, 7), product3
|
||||
(category1, 1), product4
|
||||
(category1, 2) , product1
|
||||
(category2, 3) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 7) , product3
|
||||
(category1, 1) , product4
|
||||
|
||||
The shuffle/sort step of MapReduce will then do a
|
||||
distributed sort on the keys, resulting in:
|
||||
|
||||
(category1, 1), product4
|
||||
(category1, 2), product1
|
||||
(category1, 3), product2
|
||||
(category2, 3), product1
|
||||
(category2, 7), product3
|
||||
(category1, 1) , product4
|
||||
(category1, 2) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 3) , product1
|
||||
(category2, 7) , product3
|
||||
"""
|
||||
category_id, product_id = key
|
||||
quantity = value
|
||||
yield (category_id, quantity), product_id
|
||||
yield (category_id, quantity) , product_id
|
||||
|
||||
def reducer_identity(self, key, value):
|
||||
def reducer_identity(self, key, value) :
|
||||
yield key, value
|
||||
|
||||
def steps(self):
|
||||
def steps(self) :
|
||||
"""Run the map and reduce steps."""
|
||||
return [
|
||||
self.mr(mapper=self.mapper,
|
||||
reducer=self.reducer),
|
||||
reducer=self.reducer) ,
|
||||
self.mr(mapper=self.mapper_sort,
|
||||
reducer=self.reducer_identity),
|
||||
reducer=self.reducer_identity) ,
|
||||
]
|
||||
```
|
||||
|
||||
The result would be the following sorted list, which we could insert into the `sales_rank` table:
|
||||
|
||||
```
|
||||
(category1, 1), product4
|
||||
(category1, 2), product1
|
||||
(category1, 3), product2
|
||||
(category2, 3), product1
|
||||
(category2, 7), product3
|
||||
(category1, 1) , product4
|
||||
(category1, 2) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 3) , product1
|
||||
(category2, 7) , product3
|
||||
```
|
||||
|
||||
The `sales_rank` table could have the following structure:
|
||||
@@ -191,20 +191,20 @@ id int NOT NULL AUTO_INCREMENT
|
||||
category_id int NOT NULL
|
||||
total_sold int NOT NULL
|
||||
product_id int NOT NULL
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(category_id) REFERENCES Categories(id)
|
||||
FOREIGN KEY(product_id) REFERENCES Products(id)
|
||||
PRIMARY KEY(id)
|
||||
FOREIGN KEY(category_id) REFERENCES Categories(id)
|
||||
FOREIGN KEY(product_id) REFERENCES Products(id)
|
||||
```
|
||||
|
||||
We'll create an [index](https://github.com/donnemartin/system-design-primer#use-good-indices) on `id `, `category_id`, and `product_id` to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
### Use case: User views the past week's most popular products by category
|
||||
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Read API** server
|
||||
* The **Read API** server reads from the **SQL Database** `sales_rank` table
|
||||
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl https://amazon.com/api/v1/popular?category_id=1234
|
||||
@@ -233,13 +233,13 @@ Response:
|
||||
},
|
||||
```
|
||||
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
|
||||
|
||||
## Step 4: Scale the design
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -251,33 +251,33 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
The **Analytics Database** could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
|
||||
|
||||
We might only want to store a limited time period of data in the database, while storing the rest in a data warehouse or in an **Object Store**. An **Object Store** such as Amazon S3 can comfortably handle the constraint of 40 GB of new content per month.
|
||||
|
||||
To address the 40,000 *average* read requests per second (higher at peak), traffic for popular content (and their sales rank) should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. With the large volume of reads, the **SQL Read Replicas** might not be able to handle the cache misses. We'll probably need to employ additional SQL scaling patterns.
|
||||
To address the 40,000 *average* read requests per second (higher at peak) , traffic for popular content (and their sales rank) should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. With the large volume of reads, the **SQL Read Replicas** might not be able to handle the cache misses. We'll probably need to employ additional SQL scaling patterns.
|
||||
|
||||
400 *average* writes per second (higher at peak) might be tough for a single **SQL Write Master-Slave**, also pointing to a need for additional scaling techniques.
|
||||
|
||||
SQL scaling patterns include:
|
||||
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
@@ -287,50 +287,50 @@ We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -3,75 +3,75 @@
|
||||
from mrjob.job import MRJob
|
||||
|
||||
|
||||
class SalesRanker(MRJob):
|
||||
class SalesRanker(MRJob) :
|
||||
|
||||
def within_past_week(self, timestamp):
|
||||
def within_past_week(self, timestamp) :
|
||||
"""Return True if timestamp is within past week, False otherwise."""
|
||||
...
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
"""Parse each log line, extract and transform relevant lines.
|
||||
|
||||
Emit key value pairs of the form:
|
||||
|
||||
(foo, p1), 2
|
||||
(bar, p1), 2
|
||||
(bar, p1), 1
|
||||
(foo, p2), 3
|
||||
(bar, p3), 10
|
||||
(foo, p4), 1
|
||||
(foo, p1) , 2
|
||||
(bar, p1) , 2
|
||||
(bar, p1) , 1
|
||||
(foo, p2) , 3
|
||||
(bar, p3) , 10
|
||||
(foo, p4) , 1
|
||||
"""
|
||||
timestamp, product_id, category, quantity = line.split('\t')
|
||||
if self.within_past_week(timestamp):
|
||||
yield (category, product_id), quantity
|
||||
timestamp, product_id, category, quantity = line.split('\t')
|
||||
if self.within_past_week(timestamp) :
|
||||
yield (category, product_id) , quantity
|
||||
|
||||
def reducer(self, key, values):
|
||||
def reducer(self, key, values) :
|
||||
"""Sum values for each key.
|
||||
|
||||
(foo, p1), 2
|
||||
(bar, p1), 3
|
||||
(foo, p2), 3
|
||||
(bar, p3), 10
|
||||
(foo, p4), 1
|
||||
(foo, p1) , 2
|
||||
(bar, p1) , 3
|
||||
(foo, p2) , 3
|
||||
(bar, p3) , 10
|
||||
(foo, p4) , 1
|
||||
"""
|
||||
yield key, sum(values)
|
||||
yield key, sum(values)
|
||||
|
||||
def mapper_sort(self, key, value):
|
||||
def mapper_sort(self, key, value) :
|
||||
"""Construct key to ensure proper sorting.
|
||||
|
||||
Transform key and value to the form:
|
||||
|
||||
(foo, 2), p1
|
||||
(bar, 3), p1
|
||||
(foo, 3), p2
|
||||
(bar, 10), p3
|
||||
(foo, 1), p4
|
||||
(foo, 2) , p1
|
||||
(bar, 3) , p1
|
||||
(foo, 3) , p2
|
||||
(bar, 10) , p3
|
||||
(foo, 1) , p4
|
||||
|
||||
The shuffle/sort step of MapReduce will then do a
|
||||
distributed sort on the keys, resulting in:
|
||||
|
||||
(category1, 1), product4
|
||||
(category1, 2), product1
|
||||
(category1, 3), product2
|
||||
(category2, 3), product1
|
||||
(category2, 7), product3
|
||||
(category1, 1) , product4
|
||||
(category1, 2) , product1
|
||||
(category1, 3) , product2
|
||||
(category2, 3) , product1
|
||||
(category2, 7) , product3
|
||||
"""
|
||||
category, product_id = key
|
||||
quantity = value
|
||||
yield (category, quantity), product_id
|
||||
yield (category, quantity) , product_id
|
||||
|
||||
def reducer_identity(self, key, value):
|
||||
def reducer_identity(self, key, value) :
|
||||
yield key, value
|
||||
|
||||
def steps(self):
|
||||
def steps(self) :
|
||||
"""Run the map and reduce steps."""
|
||||
return [
|
||||
self.mr(mapper=self.mapper,
|
||||
reducer=self.reducer),
|
||||
reducer=self.reducer) ,
|
||||
self.mr(mapper=self.mapper_sort,
|
||||
reducer=self.reducer_identity),
|
||||
reducer=self.reducer_identity) ,
|
||||
]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
SalesRanker.run()
|
||||
SalesRanker.run()
|
||||
|
@@ -64,7 +64,7 @@
|
||||
|
||||
> 用所有重要组件概述高水平设计
|
||||
|
||||

|
||||

|
||||
|
||||
## 第 3 步:设计核心组件
|
||||
|
||||
@@ -83,7 +83,7 @@
|
||||
|
||||
* **Web 服务器** 在 EC2 上
|
||||
* 存储用户数据
|
||||
* [**MySQL 数据库**](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [**MySQL 数据库**](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
|
||||
运用 **纵向扩展**:
|
||||
|
||||
@@ -96,7 +96,7 @@
|
||||
|
||||
**折中方案, 可选方案, 和其他细节:**
|
||||
|
||||
* **纵向扩展** 的可选方案是 [**横向扩展**](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* **纵向扩展** 的可选方案是 [**横向扩展**](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
|
||||
#### 自 SQL 开始,但认真考虑 NoSQL
|
||||
|
||||
@@ -104,7 +104,7 @@
|
||||
|
||||
**折中方案, 可选方案, 和其他细节:**
|
||||
|
||||
* 查阅 [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 章节
|
||||
* 查阅 [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 章节
|
||||
* 讨论使用 [SQL 或 NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) 的原因
|
||||
|
||||
#### 分配公共静态 IP
|
||||
@@ -139,7 +139,7 @@
|
||||
|
||||
### 用户+
|
||||
|
||||

|
||||

|
||||
|
||||
#### 假设
|
||||
|
||||
@@ -191,7 +191,7 @@
|
||||
|
||||
### 用户+++
|
||||
|
||||

|
||||

|
||||
|
||||
#### 假设
|
||||
|
||||
@@ -208,11 +208,11 @@
|
||||
* 终止在 **负载平衡器** 上的SSL,以减少后端服务器上的计算负载,并简化证书管理
|
||||
* 在多个可用区域中使用多台 **Web服务器**
|
||||
* 在多个可用区域的 [**主-从 故障转移**](https://github.com/donnemartin/system-design-primer#master-slave-replication) 模式中使用多个 **MySQL** 实例来改进冗余
|
||||
* 分离 **Web 服务器** 和 [**应用服务器**](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* 分离 **Web 服务器** 和 [**应用服务器**](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* 独立扩展和配置每一层
|
||||
* **Web 服务器** 可以作为 [**反向代理**](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* **Web 服务器** 可以作为 [**反向代理**](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* 例如, 你可以添加 **应用服务器** 处理 **读 API** 而另外一些处理 **写 API**
|
||||
* 将静态(和一些动态)内容转移到 [**内容分发网络 (CDN)**](https://github.com/donnemartin/system-design-primer#content-delivery-network) 例如 CloudFront 以减少负载和延迟
|
||||
* 将静态(和一些动态)内容转移到 [**内容分发网络 (CDN) **](https://github.com/donnemartin/system-design-primer#content-delivery-network) 例如 CloudFront 以减少负载和延迟
|
||||
|
||||
**折中方案, 可选方案, 和其他细节:**
|
||||
|
||||
@@ -220,7 +220,7 @@
|
||||
|
||||
### 用户+++
|
||||
|
||||

|
||||

|
||||
|
||||
**注意:** **内部负载均衡** 不显示以减少混乱
|
||||
|
||||
@@ -232,7 +232,7 @@
|
||||
|
||||
* 下面的目标试图解决 **MySQL数据库** 的伸缩性问题
|
||||
* * 基于 **基准/负载测试** 和 **分析**,你可能只需要实现其中的一两个技术
|
||||
* 将下列数据移动到一个 [**内存缓存**](https://github.com/donnemartin/system-design-primer#cache),例如弹性缓存,以减少负载和延迟:
|
||||
* 将下列数据移动到一个 [**内存缓存**](https://github.com/donnemartin/system-design-primer#cache) ,例如弹性缓存,以减少负载和延迟:
|
||||
* **MySQL** 中频繁访问的内容
|
||||
* 首先, 尝试配置 **MySQL 数据库** 缓存以查看是否足以在实现 **内存缓存** 之前缓解瓶颈
|
||||
* 来自 **Web 服务器** 的会话数据
|
||||
@@ -254,11 +254,11 @@
|
||||
|
||||
**折中方案, 可选方案, 和其他细节:**
|
||||
|
||||
* 查阅 [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 章节
|
||||
* 查阅 [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 章节
|
||||
|
||||
### 用户++++
|
||||
|
||||

|
||||

|
||||
|
||||
#### 假设
|
||||
|
||||
@@ -297,7 +297,7 @@
|
||||
|
||||
### 用户+++++
|
||||
|
||||

|
||||

|
||||
|
||||
**注释:** **自动伸缩** 组不显示以减少混乱
|
||||
|
||||
@@ -317,10 +317,10 @@
|
||||
|
||||
SQL 扩展模型包括:
|
||||
|
||||
* [集合](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [反范式](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [集合](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [反范式](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
为了进一步处理高读和写请求,我们还应该考虑将适当的数据移动到一个 [**NoSQL数据库**](https://github.com/donnemartin/system-design-primer#nosql) ,例如 DynamoDB。
|
||||
|
||||
@@ -344,58 +344,58 @@ SQL 扩展模型包括:
|
||||
|
||||
### SQL 扩展模式
|
||||
|
||||
* [读取副本](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [集合](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [分区](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [反规范化](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [读取副本](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [集合](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [分区](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [反规范化](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键值存储](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [文档存储](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [宽表存储](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [键值存储](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [文档存储](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [宽表存储](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 缓存到哪里
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web 服务缓存](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web 服务缓存](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* 缓存什么
|
||||
* [数据库请求层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [对象层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [数据库请求层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [对象层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* 何时更新缓存
|
||||
* [预留缓存](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [完全写入](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [延迟写 (写回)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [事先更新](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [预留缓存](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [完全写入](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [延迟写 (写回) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [事先更新](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### 异步性和微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [回退压力](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [回退压力](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### 沟通
|
||||
|
||||
* 关于折中方案的讨论:
|
||||
* 客户端的外部通讯 - [遵循 REST 的 HTTP APIs](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* 内部通讯 - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [服务探索](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* 客户端的外部通讯 - [遵循 REST 的 HTTP APIs](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* 内部通讯 - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [服务探索](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### 安全性
|
||||
|
||||
参考 [安全章节](https://github.com/donnemartin/system-design-primer#security)
|
||||
参考 [安全章节](https://github.com/donnemartin/system-design-primer#security)
|
||||
|
||||
### 延迟数字指标
|
||||
|
||||
查阅 [每个程序员必懂的延迟数字](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know)
|
||||
查阅 [每个程序员必懂的延迟数字](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know)
|
||||
|
||||
### 正在进行
|
||||
|
||||
|
@@ -64,7 +64,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -83,7 +83,7 @@ Handy conversion guide:
|
||||
|
||||
* **Web server** on EC2
|
||||
* Storage for user data
|
||||
* [**MySQL Database**](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [**MySQL Database**](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
|
||||
Use **Vertical Scaling**:
|
||||
|
||||
@@ -96,7 +96,7 @@ Use **Vertical Scaling**:
|
||||
|
||||
*Trade-offs, alternatives, and additional details:*
|
||||
|
||||
* The alternative to **Vertical Scaling** is [**Horizontal scaling**](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* The alternative to **Vertical Scaling** is [**Horizontal scaling**](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
|
||||
#### Start with SQL, consider NoSQL
|
||||
|
||||
@@ -104,8 +104,8 @@ The constraints assume there is a need for relational data. We can start off us
|
||||
|
||||
*Trade-offs, alternatives, and additional details:*
|
||||
|
||||
* See the [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) section
|
||||
* Discuss reasons to use [SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* See the [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) section
|
||||
* Discuss reasons to use [SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
#### Assign a public static IP
|
||||
|
||||
@@ -139,7 +139,7 @@ Add a **DNS** such as Route 53 to map the domain to the instance's public IP.
|
||||
|
||||
### Users+
|
||||
|
||||

|
||||

|
||||
|
||||
#### Assumptions
|
||||
|
||||
@@ -191,7 +191,7 @@ We've been able to address these issues with **Vertical Scaling** so far. Unfor
|
||||
|
||||
### Users++
|
||||
|
||||

|
||||

|
||||
|
||||
#### Assumptions
|
||||
|
||||
@@ -208,11 +208,11 @@ Our **Benchmarks/Load Tests** and **Profiling** show that our single **Web Serve
|
||||
* Terminate SSL on the **Load Balancer** to reduce computational load on backend servers and to simplify certificate administration
|
||||
* Use multiple **Web Servers** spread out over multiple availability zones
|
||||
* Use multiple **MySQL** instances in [**Master-Slave Failover**](https://github.com/donnemartin/system-design-primer#master-slave-replication) mode across multiple availability zones to improve redundancy
|
||||
* Separate out the **Web Servers** from the [**Application Servers**](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* Separate out the **Web Servers** from the [**Application Servers**](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* Scale and configure both layers independently
|
||||
* **Web Servers** can run as a [**Reverse Proxy**](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* **Web Servers** can run as a [**Reverse Proxy**](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* For example, you can add **Application Servers** handling **Read APIs** while others handle **Write APIs**
|
||||
* Move static (and some dynamic) content to a [**Content Delivery Network (CDN)**](https://github.com/donnemartin/system-design-primer#content-delivery-network) such as CloudFront to reduce load and latency
|
||||
* Move static (and some dynamic) content to a [**Content Delivery Network (CDN) **](https://github.com/donnemartin/system-design-primer#content-delivery-network) such as CloudFront to reduce load and latency
|
||||
|
||||
*Trade-offs, alternatives, and additional details:*
|
||||
|
||||
@@ -220,7 +220,7 @@ Our **Benchmarks/Load Tests** and **Profiling** show that our single **Web Serve
|
||||
|
||||
### Users+++
|
||||
|
||||

|
||||

|
||||
|
||||
**Note:** **Internal Load Balancers** not shown to reduce clutter
|
||||
|
||||
@@ -249,16 +249,16 @@ Our **Benchmarks/Load Tests** and **Profiling** show that we are read-heavy (100
|
||||
|
||||
* In addition to adding and scaling a **Memory Cache**, **MySQL Read Replicas** can also help relieve load on the **MySQL Write Master**
|
||||
* Add logic to **Web Server** to separate out writes and reads
|
||||
* Add **Load Balancers** in front of **MySQL Read Replicas** (not pictured to reduce clutter)
|
||||
* Add **Load Balancers** in front of **MySQL Read Replicas** (not pictured to reduce clutter)
|
||||
* Most services are read-heavy vs write-heavy
|
||||
|
||||
*Trade-offs, alternatives, and additional details:*
|
||||
|
||||
* See the [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) section
|
||||
* See the [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) section
|
||||
|
||||
### Users++++
|
||||
|
||||

|
||||

|
||||
|
||||
#### Assumptions
|
||||
|
||||
@@ -297,7 +297,7 @@ Our **Benchmarks/Load Tests** and **Profiling** show that our traffic spikes dur
|
||||
|
||||
### Users+++++
|
||||
|
||||

|
||||

|
||||
|
||||
**Note:** **Autoscaling** groups not shown to reduce clutter
|
||||
|
||||
@@ -317,10 +317,10 @@ We'll continue to address scaling issues due to the problem's constraints:
|
||||
|
||||
SQL scaling patterns include:
|
||||
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
To further address the high read and write requests, we should also consider moving appropriate data to a [**NoSQL Database**](https://github.com/donnemartin/system-design-primer#nosql) such as DynamoDB.
|
||||
|
||||
@@ -344,58 +344,58 @@ We can further separate out our [**Application Servers**](https://github.com/don
|
||||
|
||||
### SQL scaling patterns
|
||||
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -29,7 +29,7 @@
|
||||
* 每个用户平均有 50 个朋友
|
||||
* 每月 10 亿次朋友搜索
|
||||
|
||||
训练使用更传统的系统 - 别用图特有的解决方案例如 [GraphQL](http://graphql.org/) 或图数据库如 [Neo4j](https://neo4j.com/)。
|
||||
训练使用更传统的系统 - 别用图特有的解决方案例如 [GraphQL](http://graphql.org/) 或图数据库如 [Neo4j](https://neo4j.com/) 。
|
||||
|
||||
#### 计算使用
|
||||
|
||||
@@ -50,7 +50,7 @@
|
||||
|
||||
> 用所有重要组件概述高水平设计
|
||||
|
||||

|
||||

|
||||
|
||||
## 第 3 步:设计核心组件
|
||||
|
||||
@@ -63,37 +63,37 @@
|
||||
没有百万用户(点)的和十亿朋友关系(边)的限制,我们能够用一般 BFS 方法解决无权重最短路径任务:
|
||||
|
||||
```python
|
||||
class Graph(Graph):
|
||||
class Graph(Graph) :
|
||||
|
||||
def shortest_path(self, source, dest):
|
||||
def shortest_path(self, source, dest) :
|
||||
if source is None or dest is None:
|
||||
return None
|
||||
if source is dest:
|
||||
return [source.key]
|
||||
prev_node_keys = self._shortest_path(source, dest)
|
||||
prev_node_keys = self._shortest_path(source, dest)
|
||||
if prev_node_keys is None:
|
||||
return None
|
||||
else:
|
||||
path_ids = [dest.key]
|
||||
prev_node_key = prev_node_keys[dest.key]
|
||||
while prev_node_key is not None:
|
||||
path_ids.append(prev_node_key)
|
||||
path_ids.append(prev_node_key)
|
||||
prev_node_key = prev_node_keys[prev_node_key]
|
||||
return path_ids[::-1]
|
||||
|
||||
def _shortest_path(self, source, dest):
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
def _shortest_path(self, source, dest) :
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
prev_node_keys = {source.key: None}
|
||||
source.visit_state = State.visited
|
||||
while queue:
|
||||
node = queue.popleft()
|
||||
node = queue.popleft()
|
||||
if node is dest:
|
||||
return prev_node_keys
|
||||
prev_node = node
|
||||
for adj_node in node.adj_nodes.values():
|
||||
for adj_node in node.adj_nodes.values() :
|
||||
if adj_node.visit_state == State.unvisited:
|
||||
queue.append(adj_node)
|
||||
queue.append(adj_node)
|
||||
prev_node_keys[adj_node.key] = prev_node.key
|
||||
adj_node.visit_state = State.visited
|
||||
return None
|
||||
@@ -101,7 +101,7 @@ class Graph(Graph):
|
||||
|
||||
我们不能在同一台机器上满足所有用户,我们需要通过 **人员服务器** [拆分](https://github.com/donnemartin/system-design-primer#sharding) 用户并且通过 **查询服务** 访问。
|
||||
|
||||
* **客户端** 向 **服务器** 发送请求,**服务器** 作为 [反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* **客户端** 向 **服务器** 发送请求,**服务器** 作为 [反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* **搜索 API** 服务器向 **用户图服务** 转发请求
|
||||
* **用户图服务** 有以下功能:
|
||||
* 使用 **查询服务** 找到当前用户信息存储的 **人员服务器**
|
||||
@@ -117,43 +117,43 @@ class Graph(Graph):
|
||||
**查询服务** 实现:
|
||||
|
||||
```python
|
||||
class LookupService(object):
|
||||
class LookupService(object) :
|
||||
|
||||
def __init__(self):
|
||||
self.lookup = self._init_lookup() # key: person_id, value: person_server
|
||||
def __init__(self) :
|
||||
self.lookup = self._init_lookup() # key: person_id, value: person_server
|
||||
|
||||
def _init_lookup(self):
|
||||
def _init_lookup(self) :
|
||||
...
|
||||
|
||||
def lookup_person_server(self, person_id):
|
||||
def lookup_person_server(self, person_id) :
|
||||
return self.lookup[person_id]
|
||||
```
|
||||
|
||||
**人员服务器** 实现:
|
||||
|
||||
```python
|
||||
class PersonServer(object):
|
||||
class PersonServer(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.people = {} # key: person_id, value: person
|
||||
|
||||
def add_person(self, person):
|
||||
def add_person(self, person) :
|
||||
...
|
||||
|
||||
def people(self, ids):
|
||||
def people(self, ids) :
|
||||
results = []
|
||||
for id in ids:
|
||||
if id in self.people:
|
||||
results.append(self.people[id])
|
||||
results.append(self.people[id])
|
||||
return results
|
||||
```
|
||||
|
||||
**用户** 实现:
|
||||
|
||||
```python
|
||||
class Person(object):
|
||||
class Person(object) :
|
||||
|
||||
def __init__(self, id, name, friend_ids):
|
||||
def __init__(self, id, name, friend_ids) :
|
||||
self.id = id
|
||||
self.name = name
|
||||
self.friend_ids = friend_ids
|
||||
@@ -162,21 +162,21 @@ class Person(object):
|
||||
**用户图服务** 实现:
|
||||
|
||||
```python
|
||||
class UserGraphService(object):
|
||||
class UserGraphService(object) :
|
||||
|
||||
def __init__(self, lookup_service):
|
||||
def __init__(self, lookup_service) :
|
||||
self.lookup_service = lookup_service
|
||||
|
||||
def person(self, person_id):
|
||||
person_server = self.lookup_service.lookup_person_server(person_id)
|
||||
return person_server.people([person_id])
|
||||
def person(self, person_id) :
|
||||
person_server = self.lookup_service.lookup_person_server(person_id)
|
||||
return person_server.people([person_id])
|
||||
|
||||
def shortest_path(self, source_key, dest_key):
|
||||
def shortest_path(self, source_key, dest_key) :
|
||||
if source_key is None or dest_key is None:
|
||||
return None
|
||||
if source_key is dest_key:
|
||||
return [source_key]
|
||||
prev_node_keys = self._shortest_path(source_key, dest_key)
|
||||
prev_node_keys = self._shortest_path(source_key, dest_key)
|
||||
if prev_node_keys is None:
|
||||
return None
|
||||
else:
|
||||
@@ -184,40 +184,40 @@ class UserGraphService(object):
|
||||
path_ids = [dest_key]
|
||||
prev_node_key = prev_node_keys[dest_key]
|
||||
while prev_node_key is not None:
|
||||
path_ids.append(prev_node_key)
|
||||
path_ids.append(prev_node_key)
|
||||
prev_node_key = prev_node_keys[prev_node_key]
|
||||
# Reverse the list since we iterated backwards
|
||||
return path_ids[::-1]
|
||||
|
||||
def _shortest_path(self, source_key, dest_key, path):
|
||||
def _shortest_path(self, source_key, dest_key, path) :
|
||||
# Use the id to get the Person
|
||||
source = self.person(source_key)
|
||||
source = self.person(source_key)
|
||||
# Update our bfs queue
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
# prev_node_keys keeps track of each hop from
|
||||
# the source_key to the dest_key
|
||||
prev_node_keys = {source_key: None}
|
||||
# We'll use visited_ids to keep track of which nodes we've
|
||||
# visited, which can be different from a typical bfs where
|
||||
# this can be stored in the node itself
|
||||
visited_ids = set()
|
||||
visited_ids.add(source.id)
|
||||
visited_ids = set()
|
||||
visited_ids.add(source.id)
|
||||
while queue:
|
||||
node = queue.popleft()
|
||||
node = queue.popleft()
|
||||
if node.key is dest_key:
|
||||
return prev_node_keys
|
||||
prev_node = node
|
||||
for friend_id in node.friend_ids:
|
||||
if friend_id not in visited_ids:
|
||||
friend_node = self.person(friend_id)
|
||||
queue.append(friend_node)
|
||||
friend_node = self.person(friend_id)
|
||||
queue.append(friend_node)
|
||||
prev_node_keys[friend_id] = prev_node.key
|
||||
visited_ids.add(friend_id)
|
||||
visited_ids.add(friend_id)
|
||||
return None
|
||||
```
|
||||
|
||||
我们用的是公共的 [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
我们用的是公共的 [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl https://social.com/api/v1/friend_search?person_id=1234
|
||||
@@ -243,13 +243,13 @@ $ curl https://social.com/api/v1/friend_search?person_id=1234
|
||||
},
|
||||
```
|
||||
|
||||
内部通信使用 [远端过程调用](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)。
|
||||
内部通信使用 [远端过程调用](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) 。
|
||||
|
||||
## 第 4 步:扩展设计
|
||||
|
||||
> 在给定约束条件下,定义和确认瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要:别简化从最初设计到最终设计的过程!**
|
||||
|
||||
@@ -261,14 +261,14 @@ $ curl https://social.com/api/v1/friend_search?person_id=1234
|
||||
|
||||
**避免重复讨论**,以下网址链接到 [系统设计主题](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) 相关的主流方案、折中方案和替代方案。
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [负载均衡](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [横向扩展](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web 服务器(反向代理)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API 服务器(应用层)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [负载均衡](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [横向扩展](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web 服务器(反向代理)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API 服务器(应用层)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
解决 **平均** 每秒 400 次请求的限制(峰值),人员数据可以存在例如 Redis 或 Memcached 这样的 **内存** 中以减少响应次数和下游流量通信服务。这尤其在用户执行多次连续查询和查询哪些广泛连接的人时十分有用。从内存中读取 1MB 数据大约要 250 微秒,从 SSD 中读取同样大小的数据时间要长 4 倍,从硬盘要长 80 倍。<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
@@ -279,9 +279,9 @@ $ curl https://social.com/api/v1/friend_search?person_id=1234
|
||||
* 在同一台 **人员服务器** 上托管批处理同一批朋友查找减少机器跳转
|
||||
* 通过地理位置 [拆分](https://github.com/donnemartin/system-design-primer#sharding) **人员服务器** 来进一步优化,因为朋友通常住得都比较近
|
||||
* 同时进行两个 BFS 查找,一个从 source 开始,一个从 destination 开始,然后合并两个路径
|
||||
* 从有庞大朋友圈的人开始找起,这样更有可能减小当前用户和搜索目标之间的 [离散度数](https://en.wikipedia.org/wiki/Six_degrees_of_separation)
|
||||
* 从有庞大朋友圈的人开始找起,这样更有可能减小当前用户和搜索目标之间的 [离散度数](https://en.wikipedia.org/wiki/Six_degrees_of_separation)
|
||||
* 在询问用户是否继续查询之前设置基于时间或跳跃数阈值,当在某些案例中搜索耗费时间过长时。
|
||||
* 使用类似 [Neo4j](https://neo4j.com/) 的 **图数据库** 或图特定查询语法,例如 [GraphQL](http://graphql.org/)(如果没有禁止使用 **图数据库** 的限制的话)
|
||||
* 使用类似 [Neo4j](https://neo4j.com/) 的 **图数据库** 或图特定查询语法,例如 [GraphQL](http://graphql.org/) (如果没有禁止使用 **图数据库** 的限制的话)
|
||||
|
||||
## 额外的话题
|
||||
|
||||
@@ -289,58 +289,58 @@ $ curl https://social.com/api/v1/friend_search?person_id=1234
|
||||
|
||||
### SQL 扩展模式
|
||||
|
||||
* [读取副本](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [集合](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [分区](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [反规范化](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [读取副本](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [集合](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [分区](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [反规范化](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键值存储](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [文档存储](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [宽表存储](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [键值存储](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [文档存储](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [宽表存储](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 缓存到哪里
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web 服务缓存](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web 服务缓存](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* 缓存什么
|
||||
* [数据库请求层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [对象层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [数据库请求层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [对象层缓存](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* 何时更新缓存
|
||||
* [预留缓存](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [完全写入](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [延迟写 (写回)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [事先更新](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [预留缓存](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [完全写入](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [延迟写 (写回) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [事先更新](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### 异步性和微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [回退压力](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [回退压力](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### 沟通
|
||||
|
||||
* 关于折中方案的讨论:
|
||||
* 客户端的外部通讯 - [遵循 REST 的 HTTP APIs](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* 内部通讯 - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [服务探索](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* 客户端的外部通讯 - [遵循 REST 的 HTTP APIs](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* 内部通讯 - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [服务探索](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### 安全性
|
||||
|
||||
参考 [安全章节](https://github.com/donnemartin/system-design-primer#security)
|
||||
参考 [安全章节](https://github.com/donnemartin/system-design-primer#security)
|
||||
|
||||
### 延迟数字指标
|
||||
|
||||
查阅 [每个程序员必懂的延迟数字](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know)
|
||||
查阅 [每个程序员必懂的延迟数字](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know)
|
||||
|
||||
### 正在进行
|
||||
|
||||
|
@@ -29,7 +29,7 @@ Without an interviewer to address clarifying questions, we'll define some use ca
|
||||
* 50 friends per user average
|
||||
* 1 billion friend searches per month
|
||||
|
||||
Exercise the use of more traditional systems - don't use graph-specific solutions such as [GraphQL](http://graphql.org/) or a graph database like [Neo4j](https://neo4j.com/)
|
||||
Exercise the use of more traditional systems - don't use graph-specific solutions such as [GraphQL](http://graphql.org/) or a graph database like [Neo4j](https://neo4j.com/)
|
||||
|
||||
#### Calculate usage
|
||||
|
||||
@@ -50,7 +50,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -60,40 +60,40 @@ Handy conversion guide:
|
||||
|
||||
**Clarify with your interviewer how much code you are expected to write**.
|
||||
|
||||
Without the constraint of millions of users (vertices) and billions of friend relationships (edges), we could solve this unweighted shortest path task with a general BFS approach:
|
||||
Without the constraint of millions of users (vertices) and billions of friend relationships (edges) , we could solve this unweighted shortest path task with a general BFS approach:
|
||||
|
||||
```python
|
||||
class Graph(Graph):
|
||||
class Graph(Graph) :
|
||||
|
||||
def shortest_path(self, source, dest):
|
||||
def shortest_path(self, source, dest) :
|
||||
if source is None or dest is None:
|
||||
return None
|
||||
if source is dest:
|
||||
return [source.key]
|
||||
prev_node_keys = self._shortest_path(source, dest)
|
||||
prev_node_keys = self._shortest_path(source, dest)
|
||||
if prev_node_keys is None:
|
||||
return None
|
||||
else:
|
||||
path_ids = [dest.key]
|
||||
prev_node_key = prev_node_keys[dest.key]
|
||||
while prev_node_key is not None:
|
||||
path_ids.append(prev_node_key)
|
||||
path_ids.append(prev_node_key)
|
||||
prev_node_key = prev_node_keys[prev_node_key]
|
||||
return path_ids[::-1]
|
||||
|
||||
def _shortest_path(self, source, dest):
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
def _shortest_path(self, source, dest) :
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
prev_node_keys = {source.key: None}
|
||||
source.visit_state = State.visited
|
||||
while queue:
|
||||
node = queue.popleft()
|
||||
node = queue.popleft()
|
||||
if node is dest:
|
||||
return prev_node_keys
|
||||
prev_node = node
|
||||
for adj_node in node.adj_nodes.values():
|
||||
for adj_node in node.adj_nodes.values() :
|
||||
if adj_node.visit_state == State.unvisited:
|
||||
queue.append(adj_node)
|
||||
queue.append(adj_node)
|
||||
prev_node_keys[adj_node.key] = prev_node.key
|
||||
adj_node.visit_state = State.visited
|
||||
return None
|
||||
@@ -101,7 +101,7 @@ class Graph(Graph):
|
||||
|
||||
We won't be able to fit all users on the same machine, we'll need to [shard](https://github.com/donnemartin/system-design-primer#sharding) users across **Person Servers** and access them with a **Lookup Service**.
|
||||
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Search API** server
|
||||
* The **Search API** server forwards the request to the **User Graph Service**
|
||||
* The **User Graph Service** does the following:
|
||||
@@ -109,7 +109,7 @@ We won't be able to fit all users on the same machine, we'll need to [shard](htt
|
||||
* Finds the appropriate **Person Server** to retrieve the current user's list of `friend_ids`
|
||||
* Runs a BFS search using the current user as the `source` and the current user's `friend_ids` as the ids for each `adjacent_node`
|
||||
* To get the `adjacent_node` from a given id:
|
||||
* The **User Graph Service** will *again* need to communicate with the **Lookup Service** to determine which **Person Server** stores the`adjacent_node` matching the given id (potential for optimization)
|
||||
* The **User Graph Service** will *again* need to communicate with the **Lookup Service** to determine which **Person Server** stores the`adjacent_node` matching the given id (potential for optimization)
|
||||
|
||||
**Clarify with your interviewer how much code you should be writing**.
|
||||
|
||||
@@ -118,43 +118,43 @@ We won't be able to fit all users on the same machine, we'll need to [shard](htt
|
||||
**Lookup Service** implementation:
|
||||
|
||||
```python
|
||||
class LookupService(object):
|
||||
class LookupService(object) :
|
||||
|
||||
def __init__(self):
|
||||
self.lookup = self._init_lookup() # key: person_id, value: person_server
|
||||
def __init__(self) :
|
||||
self.lookup = self._init_lookup() # key: person_id, value: person_server
|
||||
|
||||
def _init_lookup(self):
|
||||
def _init_lookup(self) :
|
||||
...
|
||||
|
||||
def lookup_person_server(self, person_id):
|
||||
def lookup_person_server(self, person_id) :
|
||||
return self.lookup[person_id]
|
||||
```
|
||||
|
||||
**Person Server** implementation:
|
||||
|
||||
```python
|
||||
class PersonServer(object):
|
||||
class PersonServer(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.people = {} # key: person_id, value: person
|
||||
|
||||
def add_person(self, person):
|
||||
def add_person(self, person) :
|
||||
...
|
||||
|
||||
def people(self, ids):
|
||||
def people(self, ids) :
|
||||
results = []
|
||||
for id in ids:
|
||||
if id in self.people:
|
||||
results.append(self.people[id])
|
||||
results.append(self.people[id])
|
||||
return results
|
||||
```
|
||||
|
||||
**Person** implementation:
|
||||
|
||||
```python
|
||||
class Person(object):
|
||||
class Person(object) :
|
||||
|
||||
def __init__(self, id, name, friend_ids):
|
||||
def __init__(self, id, name, friend_ids) :
|
||||
self.id = id
|
||||
self.name = name
|
||||
self.friend_ids = friend_ids
|
||||
@@ -163,21 +163,21 @@ class Person(object):
|
||||
**User Graph Service** implementation:
|
||||
|
||||
```python
|
||||
class UserGraphService(object):
|
||||
class UserGraphService(object) :
|
||||
|
||||
def __init__(self, lookup_service):
|
||||
def __init__(self, lookup_service) :
|
||||
self.lookup_service = lookup_service
|
||||
|
||||
def person(self, person_id):
|
||||
person_server = self.lookup_service.lookup_person_server(person_id)
|
||||
return person_server.people([person_id])
|
||||
def person(self, person_id) :
|
||||
person_server = self.lookup_service.lookup_person_server(person_id)
|
||||
return person_server.people([person_id])
|
||||
|
||||
def shortest_path(self, source_key, dest_key):
|
||||
def shortest_path(self, source_key, dest_key) :
|
||||
if source_key is None or dest_key is None:
|
||||
return None
|
||||
if source_key is dest_key:
|
||||
return [source_key]
|
||||
prev_node_keys = self._shortest_path(source_key, dest_key)
|
||||
prev_node_keys = self._shortest_path(source_key, dest_key)
|
||||
if prev_node_keys is None:
|
||||
return None
|
||||
else:
|
||||
@@ -185,40 +185,40 @@ class UserGraphService(object):
|
||||
path_ids = [dest_key]
|
||||
prev_node_key = prev_node_keys[dest_key]
|
||||
while prev_node_key is not None:
|
||||
path_ids.append(prev_node_key)
|
||||
path_ids.append(prev_node_key)
|
||||
prev_node_key = prev_node_keys[prev_node_key]
|
||||
# Reverse the list since we iterated backwards
|
||||
return path_ids[::-1]
|
||||
|
||||
def _shortest_path(self, source_key, dest_key, path):
|
||||
def _shortest_path(self, source_key, dest_key, path) :
|
||||
# Use the id to get the Person
|
||||
source = self.person(source_key)
|
||||
source = self.person(source_key)
|
||||
# Update our bfs queue
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
# prev_node_keys keeps track of each hop from
|
||||
# the source_key to the dest_key
|
||||
prev_node_keys = {source_key: None}
|
||||
# We'll use visited_ids to keep track of which nodes we've
|
||||
# visited, which can be different from a typical bfs where
|
||||
# this can be stored in the node itself
|
||||
visited_ids = set()
|
||||
visited_ids.add(source.id)
|
||||
visited_ids = set()
|
||||
visited_ids.add(source.id)
|
||||
while queue:
|
||||
node = queue.popleft()
|
||||
node = queue.popleft()
|
||||
if node.key is dest_key:
|
||||
return prev_node_keys
|
||||
prev_node = node
|
||||
for friend_id in node.friend_ids:
|
||||
if friend_id not in visited_ids:
|
||||
friend_node = self.person(friend_id)
|
||||
queue.append(friend_node)
|
||||
friend_node = self.person(friend_id)
|
||||
queue.append(friend_node)
|
||||
prev_node_keys[friend_id] = prev_node.key
|
||||
visited_ids.add(friend_id)
|
||||
visited_ids.add(friend_id)
|
||||
return None
|
||||
```
|
||||
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl https://social.com/api/v1/friend_search?person_id=1234
|
||||
@@ -244,13 +244,13 @@ Response:
|
||||
},
|
||||
```
|
||||
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
|
||||
|
||||
## Step 4: Scale the design
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -262,16 +262,16 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
To address the constraint of 400 *average* read requests per second (higher at peak), person data can be served from a **Memory Cache** such as Redis or Memcached to reduce response times and to reduce traffic to downstream services. This could be especially useful for people who do multiple searches in succession and for people who are well-connected. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
To address the constraint of 400 *average* read requests per second (higher at peak) , person data can be served from a **Memory Cache** such as Redis or Memcached to reduce response times and to reduce traffic to downstream services. This could be especially useful for people who do multiple searches in succession and for people who are well-connected. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
Below are further optimizations:
|
||||
|
||||
@@ -282,7 +282,7 @@ Below are further optimizations:
|
||||
* Do two BFS searches at the same time, one starting from the source, and one from the destination, then merge the two paths
|
||||
* Start the BFS search from people with large numbers of friends, as they are more likely to reduce the number of [degrees of separation](https://en.wikipedia.org/wiki/Six_degrees_of_separation) between the current user and the search target
|
||||
* Set a limit based on time or number of hops before asking the user if they want to continue searching, as searching could take a considerable amount of time in some cases
|
||||
* Use a **Graph Database** such as [Neo4j](https://neo4j.com/) or a graph-specific query language such as [GraphQL](http://graphql.org/) (if there were no constraint preventing the use of **Graph Databases**)
|
||||
* Use a **Graph Database** such as [Neo4j](https://neo4j.com/) or a graph-specific query language such as [GraphQL](http://graphql.org/) (if there were no constraint preventing the use of **Graph Databases**)
|
||||
|
||||
## Additional talking points
|
||||
|
||||
@@ -290,58 +290,58 @@ Below are further optimizations:
|
||||
|
||||
### SQL scaling patterns
|
||||
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -3,70 +3,70 @@ from collections import deque
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class State(Enum):
|
||||
class State(Enum) :
|
||||
unvisited = 0
|
||||
visited = 1
|
||||
|
||||
|
||||
class Graph(object):
|
||||
class Graph(object) :
|
||||
|
||||
def bfs(self, source, dest):
|
||||
def bfs(self, source, dest) :
|
||||
if source is None:
|
||||
return False
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
queue = deque()
|
||||
queue.append(source)
|
||||
source.visit_state = State.visited
|
||||
while queue:
|
||||
node = queue.popleft()
|
||||
print(node)
|
||||
node = queue.popleft()
|
||||
print(node)
|
||||
if dest is node:
|
||||
return True
|
||||
for adjacent_node in node.adj_nodes.values():
|
||||
for adjacent_node in node.adj_nodes.values() :
|
||||
if adjacent_node.visit_state == State.unvisited:
|
||||
queue.append(adjacent_node)
|
||||
queue.append(adjacent_node)
|
||||
adjacent_node.visit_state = State.visited
|
||||
return False
|
||||
|
||||
|
||||
class Person(object):
|
||||
class Person(object) :
|
||||
|
||||
def __init__(self, id, name):
|
||||
def __init__(self, id, name) :
|
||||
self.id = id
|
||||
self.name = name
|
||||
self.friend_ids = []
|
||||
|
||||
|
||||
class LookupService(object):
|
||||
class LookupService(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.lookup = {} # key: person_id, value: person_server
|
||||
|
||||
def get_person(self, person_id):
|
||||
def get_person(self, person_id) :
|
||||
person_server = self.lookup[person_id]
|
||||
return person_server.people[person_id]
|
||||
|
||||
|
||||
class PersonServer(object):
|
||||
class PersonServer(object) :
|
||||
|
||||
def __init__(self):
|
||||
def __init__(self) :
|
||||
self.people = {} # key: person_id, value: person
|
||||
|
||||
def get_people(self, ids):
|
||||
def get_people(self, ids) :
|
||||
results = []
|
||||
for id in ids:
|
||||
if id in self.people:
|
||||
results.append(self.people[id])
|
||||
results.append(self.people[id])
|
||||
return results
|
||||
|
||||
|
||||
class UserGraphService(object):
|
||||
class UserGraphService(object) :
|
||||
|
||||
def __init__(self, person_ids, lookup):
|
||||
def __init__(self, person_ids, lookup) :
|
||||
self.lookup = lookup
|
||||
self.person_ids = person_ids
|
||||
self.visited_ids = set()
|
||||
self.visited_ids = set()
|
||||
|
||||
def bfs(self, source, dest):
|
||||
def bfs(self, source, dest) :
|
||||
# Use self.visited_ids to track visited nodes
|
||||
# Use self.lookup to translate a person_id to a Person
|
||||
pass
|
||||
|
@@ -1,6 +1,6 @@
|
||||
# 设计推特时间轴与搜索功能
|
||||
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
|
||||
**设计 Facebook 的 feed** 与**设计 Facebook 搜索**与此为同一类型问题。
|
||||
|
||||
@@ -74,11 +74,11 @@
|
||||
* 每条推特 10 KB * 每天 5 亿条推特 * 每月 30 天
|
||||
* 3 年产生新推特的内容为 5.4 PB
|
||||
* 每秒需要处理 10 万次读取请求
|
||||
* 每个月需要处理 2500 亿次请求 * (每秒 400 次请求 / 每月 10 亿次请求)
|
||||
* 每个月需要处理 2500 亿次请求 * (每秒 400 次请求 / 每月 10 亿次请求)
|
||||
* 每秒发布 6000 条推特
|
||||
* 每月发布 150 亿条推特 * (每秒 400 次请求 / 每月 10 次请求)
|
||||
* 每月发布 150 亿条推特 * (每秒 400 次请求 / 每月 10 次请求)
|
||||
* 每秒推送 6 万条推特
|
||||
* 每月推送 1500 亿条推特 * (每秒 400 次请求 / 每月 10 亿次请求)
|
||||
* 每月推送 1500 亿条推特 * (每秒 400 次请求 / 每月 10 亿次请求)
|
||||
* 每秒 4000 次搜索请求
|
||||
|
||||
便利换算指南:
|
||||
@@ -92,7 +92,7 @@
|
||||
|
||||
> 列出所有重要组件以规划概要设计。
|
||||
|
||||

|
||||

|
||||
|
||||
## 第三步:设计核心组件
|
||||
|
||||
@@ -100,13 +100,13 @@
|
||||
|
||||
### 用例:用户发表了一篇推特
|
||||
|
||||
我们可以将用户自己发表的推特存储在[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)中。我们也可以讨论一下[究竟是用 SQL 还是用 NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。
|
||||
我们可以将用户自己发表的推特存储在[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 中。我们也可以讨论一下[究竟是用 SQL 还是用 NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql) 。
|
||||
|
||||
构建用户主页时间轴(查看关注用户的活动)以及推送推特是件麻烦事。将特推传播给所有关注者(每秒约递送 6 万条推特)这一操作有可能会使传统的[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)超负载。因此,我们可以使用 **NoSQL 数据库**或**内存数据库**之类的更快的数据存储方式。从内存读取 1 MB 连续数据大约要花 250 微秒,而从 SSD 读取同样大小的数据要花费 4 倍的时间,从机械硬盘读取需要花费 80 倍以上的时间。<sup><a href=https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数>1</a></sup>
|
||||
构建用户主页时间轴(查看关注用户的活动)以及推送推特是件麻烦事。将特推传播给所有关注者(每秒约递送 6 万条推特)这一操作有可能会使传统的[关系数据库](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) 超负载。因此,我们可以使用 **NoSQL 数据库**或**内存数据库**之类的更快的数据存储方式。从内存读取 1 MB 连续数据大约要花 250 微秒,而从 SSD 读取同样大小的数据要花费 4 倍的时间,从机械硬盘读取需要花费 80 倍以上的时间。<sup><a href=https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数>1</a></sup>
|
||||
|
||||
我们可以将照片、视频之类的媒体存储于**对象存储**中。
|
||||
|
||||
* **客户端**向应用[反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)的**Web 服务器**发送一条推特
|
||||
* **客户端**向应用[反向代理](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server) 的**Web 服务器**发送一条推特
|
||||
* **Web 服务器**将请求转发给**写 API**服务器
|
||||
* **写 API**服务器将推特使用 **SQL 数据库**存储于用户时间轴中
|
||||
* **写 API**调用**消息输出服务**,进行以下操作:
|
||||
@@ -130,7 +130,7 @@
|
||||
|
||||
新发布的推特将被存储在对应用户(关注且活跃的用户)的主页时间轴的**内存缓存**中。
|
||||
|
||||
我们可以调用一个公共的 [REST API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest):
|
||||
我们可以调用一个公共的 [REST API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest) :
|
||||
|
||||
```
|
||||
$ curl -X POST --data '{ "user_id": "123", "auth_token": "ABC123", \
|
||||
@@ -150,16 +150,16 @@ $ curl -X POST --data '{ "user_id": "123", "auth_token": "ABC123", \
|
||||
}
|
||||
```
|
||||
|
||||
而对于服务器内部的通信,我们可以使用 [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)。
|
||||
而对于服务器内部的通信,我们可以使用 [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc) 。
|
||||
|
||||
### 用例:用户浏览主页时间轴
|
||||
|
||||
* **客户端**向 **Web 服务器**发起一次读取主页时间轴的请求
|
||||
* **Web 服务器**将请求转发给**读取 API**服务器
|
||||
* **读取 API**服务器调用**时间轴服务**进行以下操作:
|
||||
* 从**内存缓存**读取时间轴数据,其中包括推特 id 与用户 id - O(1)
|
||||
* 通过 [multiget](http://redis.io/commands/mget) 向**推特信息服务**进行查询,以获取相关 id 推特的额外信息 - O(n)
|
||||
* 通过 muiltiget 向**用户信息服务**进行查询,以获取相关 id 用户的额外信息 - O(n)
|
||||
* 从**内存缓存**读取时间轴数据,其中包括推特 id 与用户 id - O(1)
|
||||
* 通过 [multiget](http://redis.io/commands/mget) 向**推特信息服务**进行查询,以获取相关 id 推特的额外信息 - O(n)
|
||||
* 通过 muiltiget 向**用户信息服务**进行查询,以获取相关 id 用户的额外信息 - O(n)
|
||||
|
||||
REST API:
|
||||
|
||||
@@ -206,8 +206,8 @@ REST API 与前面的主页时间轴类似,区别只在于取出的推特是
|
||||
* 修正拼写错误
|
||||
* 规范字母大小写
|
||||
* 将查询转换为布尔操作
|
||||
* 查询**搜索集群**(例如[Lucene](https://lucene.apache.org/))检索结果:
|
||||
* 对集群内的所有服务器进行查询,将有结果的查询进行[发散聚合(Scatter gathers)](https://github.com/donnemartin/system-design-primer#under-development)
|
||||
* 查询**搜索集群**(例如[Lucene](https://lucene.apache.org/) )检索结果:
|
||||
* 对集群内的所有服务器进行查询,将有结果的查询进行[发散聚合(Scatter gathers)](https://github.com/donnemartin/system-design-primer#under-development)
|
||||
* 合并取到的条目,进行评分与排序,最终返回结果
|
||||
|
||||
REST API:
|
||||
@@ -222,7 +222,7 @@ $ curl https://twitter.com/api/v1/search?query=hello+world
|
||||
|
||||
> 根据限制条件,找到并解决瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要提示:不要从最初设计直接跳到最终设计中!**
|
||||
|
||||
@@ -232,19 +232,19 @@ $ curl https://twitter.com/api/v1/search?query=hello+world
|
||||
|
||||
我们将会介绍一些组件来完成设计,并解决架构扩张问题。内置的负载均衡器将不做讨论以节省篇幅。
|
||||
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 相关部分来了解其要点、方案的权衡取舍以及可选的替代方案。
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平拓展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [反向代理(web 服务器)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [关系型数据库管理系统 (RDBMS) ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#关系型数据库管理系统rdbms)
|
||||
* [SQL 故障主从切换](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#故障切换)
|
||||
* [主从复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
|
||||
**消息输出服务**有可能成为性能瓶颈。那些有着百万数量关注着的用户可能发一条推特就需要好几分钟才能完成消息输出进程。这有可能使 @回复 这种推特时出现竞争条件,因此需要根据服务时间对此推特进行重排序来降低影响。
|
||||
|
||||
@@ -267,10 +267,10 @@ $ curl https://twitter.com/api/v1/search?query=hello+world
|
||||
|
||||
高容量的写入将淹没单个的 **SQL 写主从**模式,因此需要更多的拓展技术。
|
||||
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
|
||||
我们也可以考虑将一些数据移至 **NoSQL 数据库**。
|
||||
|
||||
@@ -280,50 +280,50 @@ $ curl https://twitter.com/api/v1/search?query=hello+world
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
|
||||
### 缓存
|
||||
|
||||
* 在哪缓存
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* 什么需要缓存
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* 何时更新缓存
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
|
||||
### 异步与微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
|
||||
### 通信
|
||||
|
||||
* 可权衡选择的方案:
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 服务器内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
|
||||
### 安全性
|
||||
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)一章。
|
||||
请参阅[「安全」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 一章。
|
||||
|
||||
### 延迟数值
|
||||
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
|
||||
请参阅[「每个程序员都应该知道的延迟数」](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数) 。
|
||||
|
||||
### 持续探讨
|
||||
|
||||
|
@@ -18,8 +18,8 @@ Without an interviewer to address clarifying questions, we'll define some use ca
|
||||
|
||||
* **User** posts a tweet
|
||||
* **Service** pushes tweets to followers, sending push notifications and emails
|
||||
* **User** views the user timeline (activity from the user)
|
||||
* **User** views the home timeline (activity from people the user is following)
|
||||
* **User** views the user timeline (activity from the user)
|
||||
* **User** views the home timeline (activity from people the user is following)
|
||||
* **User** searches keywords
|
||||
* **Service** has high availability
|
||||
|
||||
@@ -74,13 +74,13 @@ Search
|
||||
* 10 KB per tweet * 500 million tweets per day * 30 days per month
|
||||
* 5.4 PB of new tweet content in 3 years
|
||||
* 100 thousand read requests per second
|
||||
* 250 billion read requests per month * (400 requests per second / 1 billion requests per month)
|
||||
* 250 billion read requests per month * (400 requests per second / 1 billion requests per month)
|
||||
* 6,000 tweets per second
|
||||
* 15 billion tweets per month * (400 requests per second / 1 billion requests per month)
|
||||
* 15 billion tweets per month * (400 requests per second / 1 billion requests per month)
|
||||
* 60 thousand tweets delivered on fanout per second
|
||||
* 150 billion tweets delivered on fanout per month * (400 requests per second / 1 billion requests per month)
|
||||
* 150 billion tweets delivered on fanout per month * (400 requests per second / 1 billion requests per month)
|
||||
* 4,000 search requests per second
|
||||
* 10 billion searches per month * (400 requests per second / 1 billion requests per month)
|
||||
* 10 billion searches per month * (400 requests per second / 1 billion requests per month)
|
||||
|
||||
Handy conversion guide:
|
||||
|
||||
@@ -93,7 +93,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -101,13 +101,13 @@ Handy conversion guide:
|
||||
|
||||
### Use case: User posts a tweet
|
||||
|
||||
We could store the user's own tweets to populate the user timeline (activity from the user) in a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms). We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
|
||||
We could store the user's own tweets to populate the user timeline (activity from the user) in a [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) . We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) .
|
||||
|
||||
Delivering tweets and building the home timeline (activity from people the user is following) is trickier. Fanning out tweets to all followers (60 thousand tweets delivered on fanout per second) will overload a traditional [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms). We'll probably want to choose a data store with fast writes such as a **NoSQL database** or **Memory Cache**. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
Delivering tweets and building the home timeline (activity from people the user is following) is trickier. Fanning out tweets to all followers (60 thousand tweets delivered on fanout per second) will overload a traditional [relational database](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms) . We'll probably want to choose a data store with fast writes such as a **NoSQL database** or **Memory Cache**. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
We could store media such as photos or videos on an **Object Store**.
|
||||
|
||||
* The **Client** posts a tweet to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** posts a tweet to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Write API** server
|
||||
* The **Write API** stores the tweet in the user's timeline on a **SQL database**
|
||||
* The **Write API** contacts the **Fan Out Service**, which does the following:
|
||||
@@ -129,9 +129,9 @@ If our **Memory Cache** is Redis, we could use a native Redis list with the foll
|
||||
| tweet_id user_id meta | tweet_id user_id meta | tweet_id user_id meta |
|
||||
```
|
||||
|
||||
The new tweet would be placed in the **Memory Cache**, which populates the user's home timeline (activity from people the user is following).
|
||||
The new tweet would be placed in the **Memory Cache**, which populates the user's home timeline (activity from people the user is following) .
|
||||
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl -X POST --data '{ "user_id": "123", "auth_token": "ABC123", \
|
||||
@@ -151,16 +151,16 @@ Response:
|
||||
}
|
||||
```
|
||||
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
|
||||
|
||||
### Use case: User views the home timeline
|
||||
|
||||
* The **Client** posts a home timeline request to the **Web Server**
|
||||
* The **Web Server** forwards the request to the **Read API** server
|
||||
* The **Read API** server contacts the **Timeline Service**, which does the following:
|
||||
* Gets the timeline data stored in the **Memory Cache**, containing tweet ids and user ids - O(1)
|
||||
* Queries the **Tweet Info Service** with a [multiget](http://redis.io/commands/mget) to obtain additional info about the tweet ids - O(n)
|
||||
* Queries the **User Info Service** with a multiget to obtain additional info about the user ids - O(n)
|
||||
* Gets the timeline data stored in the **Memory Cache**, containing tweet ids and user ids - O(1)
|
||||
* Queries the **Tweet Info Service** with a [multiget](http://redis.io/commands/mget) to obtain additional info about the tweet ids - O(n)
|
||||
* Queries the **User Info Service** with a multiget to obtain additional info about the user ids - O(n)
|
||||
|
||||
REST API:
|
||||
|
||||
@@ -223,7 +223,7 @@ The response would be similar to that of the home timeline, except for tweets ma
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -235,18 +235,18 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS)](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [CDN](https://github.com/donnemartin/system-design-primer#content-delivery-network)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [Relational database management system (RDBMS) ](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
* [SQL write master-slave failover](https://github.com/donnemartin/system-design-primer#fail-over)
|
||||
* [Master-slave replication](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
The **Fanout Service** is a potential bottleneck. Twitter users with millions of followers could take several minutes to have their tweets go through the fanout process. This could lead to race conditions with @replies to the tweet, which we could mitigate by re-ordering the tweets at serve time.
|
||||
|
||||
@@ -269,10 +269,10 @@ Although the **Memory Cache** should reduce the load on the database, it is unli
|
||||
|
||||
The high volume of writes would overwhelm a single **SQL Write Master-Slave**, also pointing to a need for additional scaling techniques.
|
||||
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
@@ -282,50 +282,50 @@ We should also consider moving some data to a **NoSQL Database**.
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -1,6 +1,6 @@
|
||||
# 设计一个网页爬虫
|
||||
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
**注意:这个文档中的链接会直接指向[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 中的有关部分,以避免重复的内容。你可以参考链接的相关内容,来了解其总的要点、方案的权衡取舍以及可选的替代方案。**
|
||||
|
||||
## 第一步:简述用例与约束条件
|
||||
|
||||
@@ -67,7 +67,7 @@
|
||||
|
||||
> 列出所有重要组件以规划概要设计。
|
||||
|
||||

|
||||

|
||||
|
||||
## 第三步:设计核心组件
|
||||
|
||||
@@ -75,11 +75,11 @@
|
||||
|
||||
### 用例:爬虫服务抓取一系列网页
|
||||
|
||||
假设我们有一个初始列表 `links_to_crawl`(待抓取链接),它最初基于网站整体的知名度来排序。当然如果这个假设不合理,我们可以使用 [Yahoo](https://www.yahoo.com/)、[DMOZ](http://www.dmoz.org/) 等知名门户网站作为种子链接来进行扩散 。
|
||||
假设我们有一个初始列表 `links_to_crawl`(待抓取链接),它最初基于网站整体的知名度来排序。当然如果这个假设不合理,我们可以使用 [Yahoo](https://www.yahoo.com/) 、[DMOZ](http://www.dmoz.org/) 等知名门户网站作为种子链接来进行扩散 。
|
||||
|
||||
我们将用表 `crawled_links` (已抓取链接 )来记录已经处理过的链接以及相应的页面签名。
|
||||
|
||||
我们可以将 `links_to_crawl` 和 `crawled_links` 记录在键-值型 **NoSQL 数据库**中。对于 `crawled_links` 中已排序的链接,我们可以使用 [Redis](https://redis.io/) 的有序集合来维护网页链接的排名。我们应当在 [选择 SQL 还是 NoSQL 的问题上,讨论有关使用场景以及利弊 ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)。
|
||||
我们可以将 `links_to_crawl` 和 `crawled_links` 记录在键-值型 **NoSQL 数据库**中。对于 `crawled_links` 中已排序的链接,我们可以使用 [Redis](https://redis.io/) 的有序集合来维护网页链接的排名。我们应当在 [选择 SQL 还是 NoSQL 的问题上,讨论有关使用场景以及利弊 ](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql) 。
|
||||
|
||||
* **爬虫服务**按照以下流程循环处理每一个页面链接:
|
||||
* 选取排名最靠前的待抓取链接
|
||||
@@ -88,7 +88,7 @@
|
||||
* 这样做可以避免陷入死循环
|
||||
* 继续(进入下一次循环)
|
||||
* 若不存在,则抓取该链接
|
||||
* 在**倒排索引服务**任务队列中,新增一个生成[倒排索引](https://en.wikipedia.org/wiki/Search_engine_indexing)任务。
|
||||
* 在**倒排索引服务**任务队列中,新增一个生成[倒排索引](https://en.wikipedia.org/wiki/Search_engine_indexing) 任务。
|
||||
* 在**文档服务**任务队列中,新增一个生成静态标题和摘要的任务。
|
||||
* 生成页面签名
|
||||
* 在 **NoSQL 数据库**的 `links_to_crawl` 中删除该链接
|
||||
@@ -99,33 +99,33 @@
|
||||
`PagesDataStore` 是**爬虫服务**中的一个抽象类,它使用 **NoSQL 数据库**进行存储。
|
||||
|
||||
```python
|
||||
class PagesDataStore(object):
|
||||
class PagesDataStore(object) :
|
||||
|
||||
def __init__(self, db);
|
||||
def __init__(self, db) ;
|
||||
self.db = db
|
||||
...
|
||||
|
||||
def add_link_to_crawl(self, url):
|
||||
def add_link_to_crawl(self, url) :
|
||||
"""将指定链接加入 `links_to_crawl`。"""
|
||||
...
|
||||
|
||||
def remove_link_to_crawl(self, url):
|
||||
def remove_link_to_crawl(self, url) :
|
||||
"""从 `links_to_crawl` 中删除指定链接。"""
|
||||
...
|
||||
|
||||
def reduce_priority_link_to_crawl(self, url)
|
||||
def reduce_priority_link_to_crawl(self, url)
|
||||
"""在 `links_to_crawl` 中降低一个链接的优先级以避免死循环。"""
|
||||
...
|
||||
|
||||
def extract_max_priority_page(self):
|
||||
def extract_max_priority_page(self) :
|
||||
"""返回 `links_to_crawl` 中优先级最高的链接。"""
|
||||
...
|
||||
|
||||
def insert_crawled_link(self, url, signature):
|
||||
def insert_crawled_link(self, url, signature) :
|
||||
"""将指定链接加入 `crawled_links`。"""
|
||||
...
|
||||
|
||||
def crawled_similar(self, signature):
|
||||
def crawled_similar(self, signature) :
|
||||
"""判断待抓取页面的签名是否与某个已抓取页面的签名相似。"""
|
||||
...
|
||||
```
|
||||
@@ -133,9 +133,9 @@ class PagesDataStore(object):
|
||||
`Page` 是**爬虫服务**的一个抽象类,它封装了网页对象,由页面链接、页面内容、子链接和页面签名构成。
|
||||
|
||||
```python
|
||||
class Page(object):
|
||||
class Page(object) :
|
||||
|
||||
def __init__(self, url, contents, child_urls, signature):
|
||||
def __init__(self, url, contents, child_urls, signature) :
|
||||
self.url = url
|
||||
self.contents = contents
|
||||
self.child_urls = child_urls
|
||||
@@ -145,33 +145,33 @@ class Page(object):
|
||||
`Crawler` 是**爬虫服务**的主类,由`Page` 和 `PagesDataStore` 组成。
|
||||
|
||||
```python
|
||||
class Crawler(object):
|
||||
class Crawler(object) :
|
||||
|
||||
def __init__(self, data_store, reverse_index_queue, doc_index_queue):
|
||||
def __init__(self, data_store, reverse_index_queue, doc_index_queue) :
|
||||
self.data_store = data_store
|
||||
self.reverse_index_queue = reverse_index_queue
|
||||
self.doc_index_queue = doc_index_queue
|
||||
|
||||
def create_signature(self, page):
|
||||
def create_signature(self, page) :
|
||||
"""基于页面链接与内容生成签名。"""
|
||||
...
|
||||
|
||||
def crawl_page(self, page):
|
||||
def crawl_page(self, page) :
|
||||
for url in page.child_urls:
|
||||
self.data_store.add_link_to_crawl(url)
|
||||
page.signature = self.create_signature(page)
|
||||
self.data_store.remove_link_to_crawl(page.url)
|
||||
self.data_store.insert_crawled_link(page.url, page.signature)
|
||||
self.data_store.add_link_to_crawl(url)
|
||||
page.signature = self.create_signature(page)
|
||||
self.data_store.remove_link_to_crawl(page.url)
|
||||
self.data_store.insert_crawled_link(page.url, page.signature)
|
||||
|
||||
def crawl(self):
|
||||
def crawl(self) :
|
||||
while True:
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
if page is None:
|
||||
break
|
||||
if self.data_store.crawled_similar(page.signature):
|
||||
self.data_store.reduce_priority_link_to_crawl(page.url)
|
||||
if self.data_store.crawled_similar(page.signature) :
|
||||
self.data_store.reduce_priority_link_to_crawl(page.url)
|
||||
else:
|
||||
self.crawl_page(page)
|
||||
self.crawl_page(page)
|
||||
```
|
||||
|
||||
### 处理重复内容
|
||||
@@ -186,18 +186,18 @@ class Crawler(object):
|
||||
* 假设有 10 亿条数据,我们应该使用 **MapReduce** 来输出只出现 1 次的记录。
|
||||
|
||||
```python
|
||||
class RemoveDuplicateUrls(MRJob):
|
||||
class RemoveDuplicateUrls(MRJob) :
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
yield line, 1
|
||||
|
||||
def reducer(self, key, values):
|
||||
total = sum(values)
|
||||
def reducer(self, key, values) :
|
||||
total = sum(values)
|
||||
if total == 1:
|
||||
yield key, total
|
||||
```
|
||||
|
||||
比起处理重复内容,检测重复内容更为复杂。我们可以基于网页内容生成签名,然后对比两者签名的相似度。可能会用到的算法有 [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) 以及 [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)。
|
||||
比起处理重复内容,检测重复内容更为复杂。我们可以基于网页内容生成签名,然后对比两者签名的相似度。可能会用到的算法有 [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) 以及 [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) 。
|
||||
|
||||
### 抓取结果更新策略
|
||||
|
||||
@@ -209,7 +209,7 @@ class RemoveDuplicateUrls(MRJob):
|
||||
|
||||
### 用例:用户输入搜索词后,可以看到相关的搜索结果列表,列表每一项都包含由网页爬虫生成的页面标题及摘要
|
||||
|
||||
* **客户端**向运行[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)的 **Web 服务器**发送一个请求
|
||||
* **客户端**向运行[反向代理](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器) 的 **Web 服务器**发送一个请求
|
||||
* **Web 服务器** 发送请求到 **Query API** 服务器
|
||||
* **查询 API** 服务将会做这些事情:
|
||||
* 解析查询参数
|
||||
@@ -248,14 +248,14 @@ $ curl https://search.com/api/v1/search?query=hello+world
|
||||
},
|
||||
```
|
||||
|
||||
对于服务器内部通信,我们可以使用 [远程过程调用协议(RPC)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
对于服务器内部通信,我们可以使用 [远程过程调用协议(RPC)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
|
||||
|
||||
## 第四步:架构扩展
|
||||
|
||||
> 根据限制条件,找到并解决瓶颈。
|
||||
|
||||

|
||||

|
||||
|
||||
**重要提示:不要直接从最初设计跳到最终设计!**
|
||||
|
||||
@@ -265,17 +265,17 @@ $ curl https://search.com/api/v1/search?query=hello+world
|
||||
|
||||
我们将会介绍一些组件来完成设计,并解决架构规模扩张问题。内置的负载均衡器将不做讨论以节省篇幅。
|
||||
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引)相关部分来了解其要点、方案的权衡取舍以及替代方案。
|
||||
**为了避免重复讨论**,请参考[系统设计主题索引](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#系统设计主题的索引) 相关部分来了解其要点、方案的权衡取舍以及替代方案。
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平扩展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [Web 服务器(反向代理)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务器(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#nosql)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#域名系统)
|
||||
* [负载均衡器](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#负载均衡器)
|
||||
* [水平扩展](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#水平扩展)
|
||||
* [Web 服务器(反向代理)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#反向代理web-服务器)
|
||||
* [API 服务器(应用层)](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用层)
|
||||
* [缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存)
|
||||
* [NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#nosql)
|
||||
* [一致性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#一致性模式)
|
||||
* [可用性模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#可用性模式)
|
||||
|
||||
有些搜索词非常热门,有些则非常冷门。热门的搜索词可以通过诸如 Redis 或者 Memcached 之类的**内存缓存**来缩短响应时间,避免**倒排索引服务**以及**文档服务**过载。**内存缓存**同样适用于流量分布不均匀以及流量短时高峰问题。从内存中读取 1 MB 连续数据大约需要 250 微秒,而从 SSD 读取同样大小的数据要花费 4 倍的时间,从机械硬盘读取需要花费 80 倍以上的时间。<sup><a href="https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数">1</a></sup>
|
||||
|
||||
@@ -284,7 +284,7 @@ $ curl https://search.com/api/v1/search?query=hello+world
|
||||
|
||||
* 为了处理数据大小问题以及网络请求负载,**倒排索引服务**和**文档服务**可能需要大量应用数据分片和数据复制。
|
||||
* DNS 查询可能会成为瓶颈,**爬虫服务**最好专门维护一套定期更新的 DNS 查询服务。
|
||||
* 借助于[连接池](https://en.wikipedia.org/wiki/Connection_pool),即同时维持多个开放网络连接,可以提升**爬虫服务**的性能并减少内存使用量。
|
||||
* 借助于[连接池](https://en.wikipedia.org/wiki/Connection_pool) ,即同时维持多个开放网络连接,可以提升**爬虫服务**的性能并减少内存使用量。
|
||||
* 改用 [UDP](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#用户数据报协议udp) 协议同样可以提升性能
|
||||
* 网络爬虫受带宽影响较大,请确保带宽足够维持高吞吐量。
|
||||
|
||||
@@ -294,61 +294,61 @@ $ curl https://search.com/api/v1/search?query=hello+world
|
||||
|
||||
### SQL 扩展模式
|
||||
|
||||
* [读取复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
* [读取复制](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#主从复制)
|
||||
* [联合](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#联合)
|
||||
* [分片](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#分片)
|
||||
* [非规范化](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#非规范化)
|
||||
* [SQL 调优](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-调优)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
* [键-值存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#键-值存储)
|
||||
* [文档类型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#文档类型存储)
|
||||
* [列型存储](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#列型存储)
|
||||
* [图数据库](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#图数据库)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#sql-还是-nosql)
|
||||
|
||||
|
||||
### 缓存
|
||||
|
||||
* 在哪缓存
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* [客户端缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#客户端缓存)
|
||||
* [CDN 缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#cdn-缓存)
|
||||
* [Web 服务器缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#web-服务器缓存)
|
||||
* [数据库缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库缓存)
|
||||
* [应用缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#应用缓存)
|
||||
* 什么需要缓存
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* [数据库查询级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#数据库查询级别的缓存)
|
||||
* [对象级别的缓存](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#对象级别的缓存)
|
||||
* 何时更新缓存
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
* [缓存模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#缓存模式)
|
||||
* [直写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#直写模式)
|
||||
* [回写模式](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#回写模式)
|
||||
* [刷新](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#刷新)
|
||||
|
||||
### 异步与微服务
|
||||
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
* [消息队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#消息队列)
|
||||
* [任务队列](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#任务队列)
|
||||
* [背压](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#背压)
|
||||
* [微服务](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#微服务)
|
||||
|
||||
### 通信
|
||||
|
||||
* 可权衡选择的方案:
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
* 与客户端的外部通信 - [使用 REST 作为 HTTP API](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#表述性状态转移rest)
|
||||
* 内部通信 - [RPC](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#远程过程调用协议rpc)
|
||||
* [服务发现](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#服务发现)
|
||||
|
||||
|
||||
### 安全性
|
||||
|
||||
请参阅[安全](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全)。
|
||||
请参阅[安全](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#安全) 。
|
||||
|
||||
|
||||
### 延迟数值
|
||||
|
||||
请参阅[每个程序员都应该知道的延迟数](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数)。
|
||||
请参阅[每个程序员都应该知道的延迟数](https://github.com/donnemartin/system-design-primer/blob/master/README-zh-Hans.md#每个程序员都应该知道的延迟数) 。
|
||||
|
||||
### 持续探讨
|
||||
|
||||
|
@@ -46,7 +46,7 @@ Without an interviewer to address clarifying questions, we'll define some use ca
|
||||
* For simplicity, count changes the same as new pages
|
||||
* 100 billion searches per month
|
||||
|
||||
Exercise the use of more traditional systems - don't use existing systems such as [solr](http://lucene.apache.org/solr/) or [nutch](http://nutch.apache.org/).
|
||||
Exercise the use of more traditional systems - don't use existing systems such as [solr](http://lucene.apache.org/solr/) or [nutch](http://nutch.apache.org/) .
|
||||
|
||||
#### Calculate usage
|
||||
|
||||
@@ -69,7 +69,7 @@ Handy conversion guide:
|
||||
|
||||
> Outline a high level design with all important components.
|
||||
|
||||

|
||||

|
||||
|
||||
## Step 3: Design core components
|
||||
|
||||
@@ -77,11 +77,11 @@ Handy conversion guide:
|
||||
|
||||
### Use case: Service crawls a list of urls
|
||||
|
||||
We'll assume we have an initial list of `links_to_crawl` ranked initially based on overall site popularity. If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo](https://www.yahoo.com/), [DMOZ](http://www.dmoz.org/), etc.
|
||||
We'll assume we have an initial list of `links_to_crawl` ranked initially based on overall site popularity. If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo](https://www.yahoo.com/) , [DMOZ](http://www.dmoz.org/) , etc.
|
||||
|
||||
We'll use a table `crawled_links` to store processed links and their page signatures.
|
||||
|
||||
We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Database**. For the ranked links in `links_to_crawl`, we could use [Redis](https://redis.io/) with sorted sets to maintain a ranking of page links. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql).
|
||||
We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Database**. For the ranked links in `links_to_crawl`, we could use [Redis](https://redis.io/) with sorted sets to maintain a ranking of page links. We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql) .
|
||||
|
||||
* The **Crawler Service** processes each page link by doing the following in a loop:
|
||||
* Takes the top ranked page link to crawl
|
||||
@@ -90,7 +90,7 @@ We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Datab
|
||||
* This prevents us from getting into a cycle
|
||||
* Continue
|
||||
* Else, crawls the link
|
||||
* Adds a job to the **Reverse Index Service** queue to generate a [reverse index](https://en.wikipedia.org/wiki/Search_engine_indexing)
|
||||
* Adds a job to the **Reverse Index Service** queue to generate a [reverse index](https://en.wikipedia.org/wiki/Search_engine_indexing)
|
||||
* Adds a job to the **Document Service** queue to generate a static title and snippet
|
||||
* Generates the page signature
|
||||
* Removes the link from `links_to_crawl` in the **NoSQL Database**
|
||||
@@ -101,33 +101,33 @@ We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Datab
|
||||
`PagesDataStore` is an abstraction within the **Crawler Service** that uses the **NoSQL Database**:
|
||||
|
||||
```python
|
||||
class PagesDataStore(object):
|
||||
class PagesDataStore(object) :
|
||||
|
||||
def __init__(self, db);
|
||||
def __init__(self, db) ;
|
||||
self.db = db
|
||||
...
|
||||
|
||||
def add_link_to_crawl(self, url):
|
||||
def add_link_to_crawl(self, url) :
|
||||
"""Add the given link to `links_to_crawl`."""
|
||||
...
|
||||
|
||||
def remove_link_to_crawl(self, url):
|
||||
def remove_link_to_crawl(self, url) :
|
||||
"""Remove the given link from `links_to_crawl`."""
|
||||
...
|
||||
|
||||
def reduce_priority_link_to_crawl(self, url)
|
||||
def reduce_priority_link_to_crawl(self, url)
|
||||
"""Reduce the priority of a link in `links_to_crawl` to avoid cycles."""
|
||||
...
|
||||
|
||||
def extract_max_priority_page(self):
|
||||
def extract_max_priority_page(self) :
|
||||
"""Return the highest priority link in `links_to_crawl`."""
|
||||
...
|
||||
|
||||
def insert_crawled_link(self, url, signature):
|
||||
def insert_crawled_link(self, url, signature) :
|
||||
"""Add the given link to `crawled_links`."""
|
||||
...
|
||||
|
||||
def crawled_similar(self, signature):
|
||||
def crawled_similar(self, signature) :
|
||||
"""Determine if we've already crawled a page matching the given signature"""
|
||||
...
|
||||
```
|
||||
@@ -135,9 +135,9 @@ class PagesDataStore(object):
|
||||
`Page` is an abstraction within the **Crawler Service** that encapsulates a page, its contents, child urls, and signature:
|
||||
|
||||
```python
|
||||
class Page(object):
|
||||
class Page(object) :
|
||||
|
||||
def __init__(self, url, contents, child_urls, signature):
|
||||
def __init__(self, url, contents, child_urls, signature) :
|
||||
self.url = url
|
||||
self.contents = contents
|
||||
self.child_urls = child_urls
|
||||
@@ -147,33 +147,33 @@ class Page(object):
|
||||
`Crawler` is the main class within **Crawler Service**, composed of `Page` and `PagesDataStore`.
|
||||
|
||||
```python
|
||||
class Crawler(object):
|
||||
class Crawler(object) :
|
||||
|
||||
def __init__(self, data_store, reverse_index_queue, doc_index_queue):
|
||||
def __init__(self, data_store, reverse_index_queue, doc_index_queue) :
|
||||
self.data_store = data_store
|
||||
self.reverse_index_queue = reverse_index_queue
|
||||
self.doc_index_queue = doc_index_queue
|
||||
|
||||
def create_signature(self, page):
|
||||
def create_signature(self, page) :
|
||||
"""Create signature based on url and contents."""
|
||||
...
|
||||
|
||||
def crawl_page(self, page):
|
||||
def crawl_page(self, page) :
|
||||
for url in page.child_urls:
|
||||
self.data_store.add_link_to_crawl(url)
|
||||
page.signature = self.create_signature(page)
|
||||
self.data_store.remove_link_to_crawl(page.url)
|
||||
self.data_store.insert_crawled_link(page.url, page.signature)
|
||||
self.data_store.add_link_to_crawl(url)
|
||||
page.signature = self.create_signature(page)
|
||||
self.data_store.remove_link_to_crawl(page.url)
|
||||
self.data_store.insert_crawled_link(page.url, page.signature)
|
||||
|
||||
def crawl(self):
|
||||
def crawl(self) :
|
||||
while True:
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
if page is None:
|
||||
break
|
||||
if self.data_store.crawled_similar(page.signature):
|
||||
self.data_store.reduce_priority_link_to_crawl(page.url)
|
||||
if self.data_store.crawled_similar(page.signature) :
|
||||
self.data_store.reduce_priority_link_to_crawl(page.url)
|
||||
else:
|
||||
self.crawl_page(page)
|
||||
self.crawl_page(page)
|
||||
```
|
||||
|
||||
### Handling duplicates
|
||||
@@ -188,18 +188,18 @@ We'll want to remove duplicate urls:
|
||||
* With 1 billion links to crawl, we could use **MapReduce** to output only entries that have a frequency of 1
|
||||
|
||||
```python
|
||||
class RemoveDuplicateUrls(MRJob):
|
||||
class RemoveDuplicateUrls(MRJob) :
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
yield line, 1
|
||||
|
||||
def reducer(self, key, values):
|
||||
total = sum(values)
|
||||
def reducer(self, key, values) :
|
||||
total = sum(values)
|
||||
if total == 1:
|
||||
yield key, total
|
||||
```
|
||||
|
||||
Detecting duplicate content is more complex. We could generate a signature based on the contents of the page and compare those two signatures for similarity. Some potential algorithms are [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) and [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity).
|
||||
Detecting duplicate content is more complex. We could generate a signature based on the contents of the page and compare those two signatures for similarity. Some potential algorithms are [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) and [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) .
|
||||
|
||||
### Determining when to update the crawl results
|
||||
|
||||
@@ -211,7 +211,7 @@ We might also choose to support a `Robots.txt` file that gives webmasters contro
|
||||
|
||||
### Use case: User inputs a search term and sees a list of relevant pages with titles and snippets
|
||||
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Client** sends a request to the **Web Server**, running as a [reverse proxy](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* The **Web Server** forwards the request to the **Query API** server
|
||||
* The **Query API** server does the following:
|
||||
* Parses the query
|
||||
@@ -224,7 +224,7 @@ We might also choose to support a `Robots.txt` file that gives webmasters contro
|
||||
* The **Reverse Index Service** ranks the matching results and returns the top ones
|
||||
* Uses the **Document Service** to return titles and snippets
|
||||
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest):
|
||||
We'll use a public [**REST API**](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest) :
|
||||
|
||||
```
|
||||
$ curl https://search.com/api/v1/search?query=hello+world
|
||||
@@ -250,13 +250,13 @@ Response:
|
||||
},
|
||||
```
|
||||
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc).
|
||||
For internal communications, we could use [Remote Procedure Calls](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc) .
|
||||
|
||||
## Step 4: Scale the design
|
||||
|
||||
> Identify and address bottlenecks, given the constraints.
|
||||
|
||||

|
||||

|
||||
|
||||
**Important: Do not simply jump right into the final design from the initial design!**
|
||||
|
||||
@@ -268,15 +268,15 @@ We'll introduce some components to complete the design and to address scalabilit
|
||||
|
||||
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
|
||||
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy)](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer)](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [NoSQL](https://github.com/donnemartin/system-design-primer#nosql)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
* [DNS](https://github.com/donnemartin/system-design-primer#domain-name-system)
|
||||
* [Load balancer](https://github.com/donnemartin/system-design-primer#load-balancer)
|
||||
* [Horizontal scaling](https://github.com/donnemartin/system-design-primer#horizontal-scaling)
|
||||
* [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server)
|
||||
* [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer)
|
||||
* [Cache](https://github.com/donnemartin/system-design-primer#cache)
|
||||
* [NoSQL](https://github.com/donnemartin/system-design-primer#nosql)
|
||||
* [Consistency patterns](https://github.com/donnemartin/system-design-primer#consistency-patterns)
|
||||
* [Availability patterns](https://github.com/donnemartin/system-design-primer#availability-patterns)
|
||||
|
||||
Some searches are very popular, while others are only executed once. Popular queries can be served from a **Memory Cache** such as Redis or Memcached to reduce response times and to avoid overloading the **Reverse Index Service** and **Document Service**. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><a href=https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know>1</a></sup>
|
||||
|
||||
@@ -284,7 +284,7 @@ Below are a few other optimizations to the **Crawling Service**:
|
||||
|
||||
* To handle the data size and request load, the **Reverse Index Service** and **Document Service** will likely need to make heavy use sharding and federation.
|
||||
* DNS lookup can be a bottleneck, the **Crawler Service** can keep its own DNS lookup that is refreshed periodically
|
||||
* The **Crawler Service** can improve performance and reduce memory usage by keeping many open connections at a time, referred to as [connection pooling](https://en.wikipedia.org/wiki/Connection_pool)
|
||||
* The **Crawler Service** can improve performance and reduce memory usage by keeping many open connections at a time, referred to as [connection pooling](https://en.wikipedia.org/wiki/Connection_pool)
|
||||
* Switching to [UDP](https://github.com/donnemartin/system-design-primer#user-datagram-protocol-udp) could also boost performance
|
||||
* Web crawling is bandwidth intensive, ensure there is enough bandwidth to sustain high throughput
|
||||
|
||||
@@ -294,58 +294,58 @@ Below are a few other optimizations to the **Crawling Service**:
|
||||
|
||||
### SQL scaling patterns
|
||||
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
* [Read replicas](https://github.com/donnemartin/system-design-primer#master-slave-replication)
|
||||
* [Federation](https://github.com/donnemartin/system-design-primer#federation)
|
||||
* [Sharding](https://github.com/donnemartin/system-design-primer#sharding)
|
||||
* [Denormalization](https://github.com/donnemartin/system-design-primer#denormalization)
|
||||
* [SQL Tuning](https://github.com/donnemartin/system-design-primer#sql-tuning)
|
||||
|
||||
#### NoSQL
|
||||
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
* [Key-value store](https://github.com/donnemartin/system-design-primer#key-value-store)
|
||||
* [Document store](https://github.com/donnemartin/system-design-primer#document-store)
|
||||
* [Wide column store](https://github.com/donnemartin/system-design-primer#wide-column-store)
|
||||
* [Graph database](https://github.com/donnemartin/system-design-primer#graph-database)
|
||||
* [SQL vs NoSQL](https://github.com/donnemartin/system-design-primer#sql-or-nosql)
|
||||
|
||||
### Caching
|
||||
|
||||
* Where to cache
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* [Client caching](https://github.com/donnemartin/system-design-primer#client-caching)
|
||||
* [CDN caching](https://github.com/donnemartin/system-design-primer#cdn-caching)
|
||||
* [Web server caching](https://github.com/donnemartin/system-design-primer#web-server-caching)
|
||||
* [Database caching](https://github.com/donnemartin/system-design-primer#database-caching)
|
||||
* [Application caching](https://github.com/donnemartin/system-design-primer#application-caching)
|
||||
* What to cache
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* [Caching at the database query level](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level)
|
||||
* [Caching at the object level](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level)
|
||||
* When to update the cache
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back)](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
* [Cache-aside](https://github.com/donnemartin/system-design-primer#cache-aside)
|
||||
* [Write-through](https://github.com/donnemartin/system-design-primer#write-through)
|
||||
* [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back)
|
||||
* [Refresh ahead](https://github.com/donnemartin/system-design-primer#refresh-ahead)
|
||||
|
||||
### Asynchronism and microservices
|
||||
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
* [Message queues](https://github.com/donnemartin/system-design-primer#message-queues)
|
||||
* [Task queues](https://github.com/donnemartin/system-design-primer#task-queues)
|
||||
* [Back pressure](https://github.com/donnemartin/system-design-primer#back-pressure)
|
||||
* [Microservices](https://github.com/donnemartin/system-design-primer#microservices)
|
||||
|
||||
### Communications
|
||||
|
||||
* Discuss tradeoffs:
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
* External communication with clients - [HTTP APIs following REST](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest)
|
||||
* Internal communications - [RPC](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc)
|
||||
* [Service discovery](https://github.com/donnemartin/system-design-primer#service-discovery)
|
||||
|
||||
### Security
|
||||
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security).
|
||||
Refer to the [security section](https://github.com/donnemartin/system-design-primer#security) .
|
||||
|
||||
### Latency numbers
|
||||
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know).
|
||||
See [Latency numbers every programmer should know](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know) .
|
||||
|
||||
### Ongoing
|
||||
|
||||
|
@@ -3,23 +3,23 @@
|
||||
from mrjob.job import MRJob
|
||||
|
||||
|
||||
class RemoveDuplicateUrls(MRJob):
|
||||
class RemoveDuplicateUrls(MRJob) :
|
||||
|
||||
def mapper(self, _, line):
|
||||
def mapper(self, _, line) :
|
||||
yield line, 1
|
||||
|
||||
def reducer(self, key, values):
|
||||
total = sum(values)
|
||||
def reducer(self, key, values) :
|
||||
total = sum(values)
|
||||
if total == 1:
|
||||
yield key, total
|
||||
|
||||
def steps(self):
|
||||
def steps(self) :
|
||||
"""Run the map and reduce steps."""
|
||||
return [
|
||||
self.mr(mapper=self.mapper,
|
||||
reducer=self.reducer)
|
||||
reducer=self.reducer)
|
||||
]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
RemoveDuplicateUrls.run()
|
||||
RemoveDuplicateUrls.run()
|
||||
|
@@ -1,73 +1,73 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
|
||||
class PagesDataStore(object):
|
||||
class PagesDataStore(object) :
|
||||
|
||||
def __init__(self, db):
|
||||
def __init__(self, db) :
|
||||
self.db = db
|
||||
pass
|
||||
|
||||
def add_link_to_crawl(self, url):
|
||||
def add_link_to_crawl(self, url) :
|
||||
"""Add the given link to `links_to_crawl`."""
|
||||
pass
|
||||
|
||||
def remove_link_to_crawl(self, url):
|
||||
def remove_link_to_crawl(self, url) :
|
||||
"""Remove the given link from `links_to_crawl`."""
|
||||
pass
|
||||
|
||||
def reduce_priority_link_to_crawl(self, url):
|
||||
def reduce_priority_link_to_crawl(self, url) :
|
||||
"""Reduce the priority of a link in `links_to_crawl` to avoid cycles."""
|
||||
pass
|
||||
|
||||
def extract_max_priority_page(self):
|
||||
def extract_max_priority_page(self) :
|
||||
"""Return the highest priority link in `links_to_crawl`."""
|
||||
pass
|
||||
|
||||
def insert_crawled_link(self, url, signature):
|
||||
def insert_crawled_link(self, url, signature) :
|
||||
"""Add the given link to `crawled_links`."""
|
||||
pass
|
||||
|
||||
def crawled_similar(self, signature):
|
||||
def crawled_similar(self, signature) :
|
||||
"""Determine if we've already crawled a page matching the given signature"""
|
||||
pass
|
||||
|
||||
|
||||
class Page(object):
|
||||
class Page(object) :
|
||||
|
||||
def __init__(self, url, contents, child_urls):
|
||||
def __init__(self, url, contents, child_urls) :
|
||||
self.url = url
|
||||
self.contents = contents
|
||||
self.child_urls = child_urls
|
||||
self.signature = self.create_signature()
|
||||
self.signature = self.create_signature()
|
||||
|
||||
def create_signature(self):
|
||||
def create_signature(self) :
|
||||
# Create signature based on url and contents
|
||||
pass
|
||||
|
||||
|
||||
class Crawler(object):
|
||||
class Crawler(object) :
|
||||
|
||||
def __init__(self, pages, data_store, reverse_index_queue, doc_index_queue):
|
||||
def __init__(self, pages, data_store, reverse_index_queue, doc_index_queue) :
|
||||
self.pages = pages
|
||||
self.data_store = data_store
|
||||
self.reverse_index_queue = reverse_index_queue
|
||||
self.doc_index_queue = doc_index_queue
|
||||
|
||||
def crawl_page(self, page):
|
||||
def crawl_page(self, page) :
|
||||
for url in page.child_urls:
|
||||
self.data_store.add_link_to_crawl(url)
|
||||
self.reverse_index_queue.generate(page)
|
||||
self.doc_index_queue.generate(page)
|
||||
self.data_store.remove_link_to_crawl(page.url)
|
||||
self.data_store.insert_crawled_link(page.url, page.signature)
|
||||
self.data_store.add_link_to_crawl(url)
|
||||
self.reverse_index_queue.generate(page)
|
||||
self.doc_index_queue.generate(page)
|
||||
self.data_store.remove_link_to_crawl(page.url)
|
||||
self.data_store.insert_crawled_link(page.url, page.signature)
|
||||
|
||||
def crawl(self):
|
||||
def crawl(self) :
|
||||
while True:
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
if page is None:
|
||||
break
|
||||
if self.data_store.crawled_similar(page.signature):
|
||||
self.data_store.reduce_priority_link_to_crawl(page.url)
|
||||
if self.data_store.crawled_similar(page.signature) :
|
||||
self.data_store.reduce_priority_link_to_crawl(page.url)
|
||||
else:
|
||||
self.crawl_page(page)
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
self.crawl_page(page)
|
||||
page = self.data_store.extract_max_priority_page()
|
||||
|
Reference in New Issue
Block a user