Merge remote-tracking branch 'upstream/master'
commit
ae1e452a9e
22
README-ja.md
22
README-ja.md
|
@ -45,7 +45,7 @@ lang: ja
|
|||
* [学習指針](#学習指針)
|
||||
* [システム設計面接課題にどのように準備するか](#システム設計面接にどのようにして臨めばいいか)
|
||||
* [システム設計課題例 **とその解答**](#システム設計課題例とその解答)
|
||||
* [オブジェクト思考設計課題例、 **とその解答**](#オブジェクト志向設計問題と解答)
|
||||
* [オブジェクト指向設計課題例、 **とその解答**](#オブジェクト指向設計問題と解答)
|
||||
* [その他のシステム設計面接課題例](#他のシステム設計面接例題)
|
||||
|
||||
## 暗記カード
|
||||
|
@ -59,7 +59,7 @@ lang: ja
|
|||
|
||||
* [システム設計デッキ](resources/flash_cards/System%20Design.apkg)
|
||||
* [システム設計練習課題デッキ](resources/flash_cards/System%20Design%20Exercises.apkg)
|
||||
* [オブジェクト思考練習課題デッキ](resources/flash_cards/OO%20Design.apkg)
|
||||
* [オブジェクト指向練習課題デッキ](resources/flash_cards/OO%20Design.apkg)
|
||||
|
||||
外出先や移動中の勉強に役立つでしょう。
|
||||
|
||||
|
@ -216,7 +216,7 @@ lang: ja
|
|||
| 次のリンク先のいくつかのページを読む [実世界でのアーキテクチャ](#実世界のアーキテクチャ) | :+1: | :+1: | :+1: |
|
||||
| 復習する [システム設計面接課題にどのように準備するか](#システム設計面接にどのようにして臨めばいいか) | :+1: | :+1: | :+1: |
|
||||
| とりあえず一周する [システム設計課題例](#システム設計課題例とその解答) | Some | Many | Most |
|
||||
| とりあえず一周する [オブジェクト志向設計問題と解答](#オブジェクト志向設計問題と解答) | Some | Many | Most |
|
||||
| とりあえず一周する [オブジェクト指向設計問題と解答](#オブジェクト指向設計問題と解答) | Some | Many | Most |
|
||||
| 復習する [その他システム設計面接での質問例](#他のシステム設計面接例題) | Some | Many | Most |
|
||||
|
||||
## システム設計面接にどのようにして臨めばいいか
|
||||
|
@ -353,9 +353,9 @@ lang: ja
|
|||
|
||||

|
||||
|
||||
## オブジェクト志向設計問題と解答
|
||||
## オブジェクト指向設計問題と解答
|
||||
|
||||
> 頻出のオブジェクト志向システム設計面接課題と参考解答、コード及びダイアグラム
|
||||
> 頻出のオブジェクト指向システム設計面接課題と参考解答、コード及びダイアグラム
|
||||
>
|
||||
> 解答は `solutions/` フォルダ以下にリンクが貼られている
|
||||
|
||||
|
@ -370,7 +370,7 @@ lang: ja
|
|||
| 駐車場の設計 | [解答](solutions/object_oriented_design/parking_lot/parking_lot.ipynb) |
|
||||
| チャットサーバーの設計 | [解答](solutions/object_oriented_design/online_chat/online_chat.ipynb) |
|
||||
| 円形配列の設計 | [Contribute](#contributing) |
|
||||
| オブジェクト志向システム設計問題を追加する | [Contribute](#contributing) |
|
||||
| オブジェクト指向システム設計問題を追加する | [Contribute](#contributing) |
|
||||
|
||||
## システム設計トピックス: まずはここから
|
||||
|
||||
|
@ -392,7 +392,7 @@ lang: ja
|
|||
|
||||
### ステップ 2: スケーラビリティに関する資料を読んで復習する
|
||||
|
||||
[スケーラビリティ](http://www.lecloud.net/tagged/scalability)
|
||||
[スケーラビリティ](http://www.lecloud.net/tagged/scalability/chrono)
|
||||
|
||||
* ここで触れられているトピックス:
|
||||
* [クローン](http://www.lecloud.net/post/7295452622/scalability-for-dummies-part-1-clones)
|
||||
|
@ -989,7 +989,7 @@ NoSQL は **key-value store**、 **document-store**、 **wide column store**、
|
|||
|
||||
##### その他の参考資料、ページ: ドキュメントストア
|
||||
|
||||
* [ドキュメント志向 データベース](https://en.wikipedia.org/wiki/Document-oriented_database)
|
||||
* [ドキュメント指向 データベース](https://en.wikipedia.org/wiki/Document-oriented_database)
|
||||
* [MongoDB アーキテクチャ](https://www.mongodb.com/mongodb-architecture)
|
||||
* [CouchDB アーキテクチャ](https://blog.couchdb.org/2016/08/01/couchdb-2-0-architecture/)
|
||||
* [Elasticsearch アーキテクチャ](https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up)
|
||||
|
@ -1173,7 +1173,7 @@ Redisはさらに以下のような機能を備えています:
|
|||
* エントリをキャッシュに追加します
|
||||
* エントリを返します
|
||||
|
||||
```
|
||||
```python
|
||||
def get_user(self, user_id):
|
||||
user = cache.get("user.{0}", user_id)
|
||||
if user is None:
|
||||
|
@ -1216,7 +1216,7 @@ set_user(12345, {"foo":"bar"})
|
|||
|
||||
キャッシュコード:
|
||||
|
||||
```
|
||||
```python
|
||||
def set_user(user_id, values):
|
||||
user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
|
||||
cache.set(user_id, user)
|
||||
|
@ -1562,7 +1562,7 @@ Latency Comparison Numbers
|
|||
L1 cache reference 0.5 ns
|
||||
Branch mispredict 5 ns
|
||||
L2 cache reference 7 ns 14x L1 cache
|
||||
Mutex lock/unlock 100 ns
|
||||
Mutex lock/unlock 25 ns
|
||||
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
|
||||
Compress 1K bytes with Zippy 10,000 ns 10 us
|
||||
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
|
||||
|
|
|
@ -218,7 +218,7 @@ lang: zh-Hans
|
|||
|
||||
* **短期** - 以系统设计主题的**广度**为目标。通过解决**一些**面试题来练习。
|
||||
* **中期** - 以系统设计主题的**广度**和**初级深度**为目标。通过解决**很多**面试题来练习。
|
||||
* **长期** - 以系统设计主题的**广度**和**高级深度**为目标。通过解决**大部分**面试题来联系。
|
||||
* **长期** - 以系统设计主题的**广度**和**高级深度**为目标。通过解决**大部分**面试题来练习。
|
||||
|
||||
| | 短期 | 中期 | 长期 |
|
||||
| ---------------------------------------- | ---- | ---- | ---- |
|
||||
|
@ -269,20 +269,20 @@ lang: zh-Hans
|
|||
* 数据库查找
|
||||
* API 和面向对象设计
|
||||
|
||||
### 第四步:度量设计
|
||||
### 第四步:扩展设计
|
||||
|
||||
确认和处理瓶颈以及一些限制。举例来说就是你需要下面的这些来完成拓展性的议题吗?
|
||||
确认和处理瓶颈以及一些限制。举例来说就是你需要下面的这些来完成扩展性的议题吗?
|
||||
|
||||
* 负载均衡
|
||||
* 水平拓展
|
||||
* 水平扩展
|
||||
* 缓存
|
||||
* 数据库分片
|
||||
|
||||
论述可能的解决办法和代价。每件事情需要取舍。可以使用[可拓展系统的设计原则](#系统设计主题的索引)来处理瓶颈。
|
||||
论述可能的解决办法和代价。每件事情需要取舍。可以使用[可扩展系统的设计原则](#系统设计主题的索引)来处理瓶颈。
|
||||
|
||||
### 预估计算量
|
||||
|
||||
你或许会被要求通过手算进行一些估算。涉及到的[附录](#附录)涉及到的是下面的这些资源:
|
||||
你或许会被要求通过手算进行一些估算。[附录](#附录)涉及到的是下面的这些资源:
|
||||
|
||||
* [使用预估计算量](http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html)
|
||||
* [2 的次方表](#2-的次方表)
|
||||
|
@ -402,7 +402,7 @@ lang: zh-Hans
|
|||
|
||||
### 第二步:回顾可扩展性文章
|
||||
|
||||
[可扩展性](http://www.lecloud.net/tagged/scalability)
|
||||
[可扩展性](http://www.lecloud.net/tagged/scalability/chrono)
|
||||
|
||||
* 主题涵盖:
|
||||
* [Clones](http://www.lecloud.net/post/7295452622/scalability-for-dummies-part-1-clones)
|
||||
|
@ -1187,7 +1187,7 @@ Redis 有下列附加功能:
|
|||
- 将查找到的结果存储到缓存中
|
||||
- 返回所需内容
|
||||
|
||||
```
|
||||
```python
|
||||
def get_user(self, user_id):
|
||||
user = cache.get("user.{0}", user_id)
|
||||
if user is None:
|
||||
|
@ -1230,7 +1230,7 @@ set_user(12345, {"foo":"bar"})
|
|||
|
||||
缓存代码:
|
||||
|
||||
```
|
||||
```python
|
||||
def set_user(user_id, values):
|
||||
user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
|
||||
cache.set(user_id, user)
|
||||
|
@ -1579,7 +1579,7 @@ Latency Comparison Numbers
|
|||
L1 cache reference 0.5 ns
|
||||
Branch mispredict 5 ns
|
||||
L2 cache reference 7 ns 14x L1 cache
|
||||
Mutex lock/unlock 100 ns
|
||||
Mutex lock/unlock 25 ns
|
||||
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
|
||||
Compress 1K bytes with Zippy 10,000 ns 10 us
|
||||
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
|
||||
|
@ -1628,7 +1628,7 @@ Notes
|
|||
| 设计类似于 Google 的搜索引擎 | [queue.acm.org](http://queue.acm.org/detail.cfm?id=988407)<br/>[stackexchange.com](http://programmers.stackexchange.com/questions/38324/interview-question-how-would-you-implement-google-search)<br/>[ardendertat.com](http://www.ardendertat.com/2012/01/11/implementing-search-engines/)<br/>[stanford.edu](http://infolab.stanford.edu/~backrub/google.html) |
|
||||
| 设计类似于 Google 的可扩展网络爬虫 | [quora.com](https://www.quora.com/How-can-I-build-a-web-crawler-from-scratch) |
|
||||
| 设计 Google 文档 | [code.google.com](https://code.google.com/p/google-mobwrite/)<br/>[neil.fraser.name](https://neil.fraser.name/writing/sync/) |
|
||||
| 设计类似 Redis 的建值存储 | [slideshare.net](http://www.slideshare.net/dvirsky/introduction-to-redis) |
|
||||
| 设计类似 Redis 的键值存储 | [slideshare.net](http://www.slideshare.net/dvirsky/introduction-to-redis) |
|
||||
| 设计类似 Memcached 的缓存系统 | [slideshare.net](http://www.slideshare.net/oemebamo/introduction-to-memcached) |
|
||||
| 设计类似亚马逊的推荐系统 | [hulu.com](http://tech.hulu.com/blog/2011/09/19/recommendation-system.html)<br/>[ijcai13.org](http://ijcai13.org/files/tutorial_slides/td3.pdf) |
|
||||
| 设计类似 Bitly 的短链接系统 | [n00tc0d3r.blogspot.com](http://n00tc0d3r.blogspot.com/) |
|
||||
|
|
|
@ -102,16 +102,16 @@ lang: zh-TW
|
|||
<br/>
|
||||
</p>
|
||||
|
||||
* [系統設計主題:從這裡開始](#系統設計主題:從這裡開始)
|
||||
* [第一步:複習關於可擴展性的影片講座](#第一步:複習關於可擴展性的影片講座)
|
||||
* [第二步:複習關於可擴展性的文章](#第二步:複習關於可擴展性的文章)
|
||||
* [系統設計主題:從這裡開始](#系統設計主題從這裡開始)
|
||||
* [第一步:複習關於可擴展性的影片講座](#第一步複習關於可擴展性的影片講座)
|
||||
* [第二步:複習關於可擴展性的文章](#第二步複習關於可擴展性的文章)
|
||||
* [下一步](#下一步)
|
||||
* [效能與可擴展性](#效能與可擴展性)
|
||||
* [延遲與吞吐量](#延遲與吞吐量)
|
||||
* [可用性與一致性](#可用性與一致性)
|
||||
* [CAP 理論](#CAP-理論)
|
||||
* [CP - 一致性與部分容錯性](#CP-一致性與部分容錯性)
|
||||
* [AP - 可用性與部分容錯性](#AP-可用性與部分容錯性)
|
||||
* [CAP 理論](#cap-理論)
|
||||
* [CP-一致性與部分容錯性](#cp-一致性與部分容錯性)
|
||||
* [AP-可用性與部分容錯性](#ap-可用性與部分容錯性)
|
||||
* [一致性模式](#一致性模式)
|
||||
* [弱一致性](#弱一致性)
|
||||
* [最終一致性](#最終一致性)
|
||||
|
@ -120,37 +120,37 @@ lang: zh-TW
|
|||
* [容錯轉移](#容錯轉移)
|
||||
* [複寫機制](#複寫機制)
|
||||
* [域名系統](#域名系統)
|
||||
* [內容傳遞網路(CDN)](#內容傳遞網路(CDN))
|
||||
* [推送式 CDNs](#推送式-CDNs)
|
||||
* [拉取式 CDNs](#拉取式-CDNs)
|
||||
* [內容傳遞網路(CDN)](#內容傳遞網路cdn)
|
||||
* [推送式 CDNs](#推送式-cdns)
|
||||
* [拉取式 CDNs](#拉取式-cdns)
|
||||
* [負載平衡器](#負載平衡器)
|
||||
* [主動到備用切換模式(AP Mode)](#主動到備用切換模式-(AP-Mode)-)
|
||||
* [雙主動切換模式(AA Mode)](#雙主動切換模式-(AA-Mode)-)
|
||||
* [主動到備用切換模式(AP Mode)](#主動到備用切換模式ap-mode)
|
||||
* [雙主動切換模式(AA Mode)](#雙主動切換模式aa-mode)
|
||||
* [第四層負載平衡](#第四層負載平衡)
|
||||
* [第七層負載平衡](#第七層負載平衡)
|
||||
* [水平擴展](#水平擴展)
|
||||
* [反向代理(網頁伺服器)](#反向代理(網頁伺服器))
|
||||
* [反向代理(網頁伺服器)](#反向代理網頁伺服器)
|
||||
* [負載平衡器與反向代理伺服器](#負載平衡器與反向代理伺服器)
|
||||
* [應用層](#應用層)
|
||||
* [微服務](#微服務)
|
||||
* [服務發現](#服務發現)
|
||||
* [資料庫](#資料庫)
|
||||
* [關連式資料庫管理系統(RDBMS)](#關連式資料庫管理系統(RDBMS))
|
||||
* [關連式資料庫管理系統(RDBMS)](#關連式資料庫管理系統rdbms)
|
||||
* [主從複寫](#主從複寫)
|
||||
* [主動模式複寫](#主動模式複寫)
|
||||
* [聯邦式資料庫](#聯邦式資料庫)
|
||||
* [分片](#分片)
|
||||
* [反正規化](#反正規化)
|
||||
* [SQL 優化](#SQL-優化)
|
||||
* [NoSQL](#NoSQL)
|
||||
* [SQL 優化](#sql-優化)
|
||||
* [NoSQL](#nosql)
|
||||
* [鍵-值對的資料庫](#鍵-值對的資料庫)
|
||||
* [文件類型資料庫](#文件類型資料庫)
|
||||
* [列儲存型資料庫](#列儲存型資料庫)
|
||||
* [圖形資料庫](#圖形資料庫)
|
||||
* [SQL 或 NoSQL](#SQL-或-NoSQL)
|
||||
* [SQL 或 NoSQL](#sql-或-nosql)
|
||||
* [快取](#快取)
|
||||
* [客戶端快取](#客戶端快取)
|
||||
* [CDN 快取](#CDN-快取)
|
||||
* [CDN 快取](#cdn-快取)
|
||||
* [網站伺服器快取](#網站伺服器快取)
|
||||
* [資料庫快取](#資料庫快取)
|
||||
* [應用程式快取](#應用程式快取)
|
||||
|
@ -159,21 +159,21 @@ lang: zh-TW
|
|||
* [什麼時候要更新快取](#什麼時候要更新快取)
|
||||
* [快取模式](#快取模式)
|
||||
* [寫入模式](#寫入模式)
|
||||
* [事後寫入(回寫)](#事後寫入(回寫))
|
||||
* [事後寫入(回寫)](#事後寫入回寫)
|
||||
* [更新式快取](#更新式快取)
|
||||
* [非同步機制](#非同步機制)
|
||||
* [訊息佇列](#訊息佇列)
|
||||
* [工作佇列](#工作佇列)
|
||||
* [背壓機制](#背壓機制)
|
||||
* [通訊](#通訊)
|
||||
* [傳輸控制通訊協定(TCP)](#傳輸控制通訊協定(TCP))
|
||||
* [傳輸控制通訊協定(TCP)](#傳輸控制通訊協定tcp)
|
||||
* [使用者資料流通訊協定 (UDP)](#使用者資料流通訊協定-udp)
|
||||
* [遠端程式呼叫 (RPC)](#遠端程式呼叫-rpc)
|
||||
* [具象狀態轉移 (REST)](#具象狀態轉移-rest)
|
||||
* [資訊安全](#資訊安全)
|
||||
* [附錄](#附錄)
|
||||
* [2 的次方表](#2-的次方表)
|
||||
* [每個開發者都應該知道的延遲數量](#每個開發者都應該知道的延遲數量)
|
||||
* [每個開發者都應該知道的延遲數量級](#每個開發者都應該知道的延遲數量級)
|
||||
* [其他的系統設計面試問題](#其他的系統設計面試問題)
|
||||
* [真實世界的架構](#真實世界的架構)
|
||||
* [公司的系統架構](#公司的系統架構)
|
||||
|
@ -391,7 +391,7 @@ lang: zh-TW
|
|||
|
||||
### 第二步:複習關於可擴展性的文章
|
||||
|
||||
[可擴展性](http://www.lecloud.net/tagged/scalability)
|
||||
[可擴展性](http://www.lecloud.net/tagged/scalability/chrono)
|
||||
|
||||
* 包含以下主題:
|
||||
* [複製](http://www.lecloud.net/post/7295452622/scalability-for-dummies-part-1-clones)
|
||||
|
@ -1174,7 +1174,7 @@ Redis 還有以下額外的功能:
|
|||
* 將該筆記錄儲存到快取
|
||||
* 將資料返回
|
||||
|
||||
```
|
||||
```python
|
||||
def get_user(self, user_id):
|
||||
user = cache.get("user.{0}", user_id)
|
||||
if user is None:
|
||||
|
@ -1217,7 +1217,7 @@ set_user(12345, {"foo":"bar"})
|
|||
|
||||
快取程式碼:
|
||||
|
||||
```
|
||||
```python
|
||||
def set_user(user_id, values):
|
||||
user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
|
||||
cache.set(user_id, user)
|
||||
|
@ -1563,7 +1563,7 @@ REST 關注於揭露資料,減少客戶端/伺服器之間耦合的程度,
|
|||
L1 快取參考數量級 0.5 ns
|
||||
Branch mispredict 5 ns
|
||||
L2 快取參考數量級 7 ns 14x L1 cache
|
||||
Mutex lock/unlock 100 ns
|
||||
Mutex lock/unlock 25 ns
|
||||
主記憶體參考數量級 100 ns 20x L2 cache, 200x L1 cache
|
||||
Compress 1K bytes with Zippy 10,000 ns 10 us
|
||||
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
|
||||
|
|
38
README.md
38
README.md
|
@ -5,7 +5,7 @@ date: 2018
|
|||
lang: en
|
||||
---
|
||||
|
||||
*[English](README.md) ∙ [日本語](README-ja.md) ∙ [简体中文](README-zh-Hans.md) ∙ [繁體中文](README-zh-TW.md) | [Arabic](https://github.com/donnemartin/system-design-primer/issues/170) ∙ [Brazilian Portuguese](https://github.com/donnemartin/system-design-primer/issues/40) ∙ [German](https://github.com/donnemartin/system-design-primer/issues/186) ∙ [Greek](https://github.com/donnemartin/system-design-primer/issues/130) ∙ [Italian](https://github.com/donnemartin/system-design-primer/issues/104) ∙ [Korean](https://github.com/donnemartin/system-design-primer/issues/102) ∙ [Persian](https://github.com/donnemartin/system-design-primer/issues/110) ∙ [Polish](https://github.com/donnemartin/system-design-primer/issues/68) ∙ [Russian](https://github.com/donnemartin/system-design-primer/issues/87) ∙ [Spanish](https://github.com/donnemartin/system-design-primer/issues/136) ∙ [Thai](https://github.com/donnemartin/system-design-primer/issues/187) ∙ [Turkish](https://github.com/donnemartin/system-design-primer/issues/39) ∙ [Vietnamese](https://github.com/donnemartin/system-design-primer/issues/127) | [Add Translation](https://github.com/donnemartin/system-design-primer/issues/28)*
|
||||
*[English](README.md) ∙ [日本語](README-ja.md) ∙ [简体中文](README-zh-Hans.md) ∙ [繁體中文](README-zh-TW.md) | [العَرَبِيَّة](https://github.com/donnemartin/system-design-primer/issues/170) ∙ [বাংলা](https://github.com/donnemartin/system-design-primer/issues/220) ∙ [Português do Brasil](https://github.com/donnemartin/system-design-primer/issues/40) ∙ [Deutsch](https://github.com/donnemartin/system-design-primer/issues/186) ∙ [ελληνικά](https://github.com/donnemartin/system-design-primer/issues/130) ∙ [Italiano](https://github.com/donnemartin/system-design-primer/issues/104) ∙ [韓國語](https://github.com/donnemartin/system-design-primer/issues/102) ∙ [فارسی](https://github.com/donnemartin/system-design-primer/issues/110) ∙ [Polski](https://github.com/donnemartin/system-design-primer/issues/68) ∙ [русский язык](https://github.com/donnemartin/system-design-primer/issues/87) ∙ [Español](https://github.com/donnemartin/system-design-primer/issues/136) ∙ [ภาษาไทย](https://github.com/donnemartin/system-design-primer/issues/187) ∙ [Türkçe](https://github.com/donnemartin/system-design-primer/issues/39) ∙ [tiếng Việt](https://github.com/donnemartin/system-design-primer/issues/127) ∙ [Français](https://github.com/donnemartin/system-design-primer/issues/250) | [Add Translation](https://github.com/donnemartin/system-design-primer/issues/28)*
|
||||
|
||||
# The System Design Primer
|
||||
|
||||
|
@ -30,7 +30,7 @@ This repo is an **organized collection** of resources to help you learn how to b
|
|||
|
||||
### Learn from the open source community
|
||||
|
||||
This is an early draft of a continually updated, open source project.
|
||||
This is a continually updated, open source project.
|
||||
|
||||
[Contributions](#contributing) are welcome!
|
||||
|
||||
|
@ -392,7 +392,7 @@ First, you'll need a basic understanding of common principles, learning about wh
|
|||
|
||||
### Step 2: Review the scalability article
|
||||
|
||||
[Scalability](http://www.lecloud.net/tagged/scalability)
|
||||
[Scalability](http://www.lecloud.net/tagged/scalability/chrono)
|
||||
|
||||
* Topics covered:
|
||||
* [Clones](http://www.lecloud.net/post/7295452622/scalability-for-dummies-part-1-clones)
|
||||
|
@ -946,7 +946,7 @@ Benchmarking and profiling might point you to the following optimizations.
|
|||
|
||||
### NoSQL
|
||||
|
||||
NoSQL is a collection of data items represented in a **key-value store**, **document-store**, **wide column store**, or a **graph database**. Data is denormalized, and joins are generally done in the application code. Most NoSQL stores lack true ACID transactions and favor [eventual consistency](#eventual-consistency).
|
||||
NoSQL is a collection of data items represented in a **key-value store**, **document store**, **wide column store**, or a **graph database**. Data is denormalized, and joins are generally done in the application code. Most NoSQL stores lack true ACID transactions and favor [eventual consistency](#eventual-consistency).
|
||||
|
||||
**BASE** is often used to describe the properties of NoSQL databases. In comparison with the [CAP Theorem](#cap-theorem), BASE chooses availability over consistency.
|
||||
|
||||
|
@ -954,7 +954,7 @@ NoSQL is a collection of data items represented in a **key-value store**, **docu
|
|||
* **Soft state** - the state of the system may change over time, even without input.
|
||||
* **Eventual consistency** - the system will become consistent over a period of time, given that the system doesn't receive input during that period.
|
||||
|
||||
In addition to choosing between [SQL or NoSQL](#sql-or-nosql), it is helpful to understand which type of NoSQL database best fits your use case(s). We'll review **key-value stores**, **document-stores**, **wide column stores**, and **graph databases** in the next section.
|
||||
In addition to choosing between [SQL or NoSQL](#sql-or-nosql), it is helpful to understand which type of NoSQL database best fits your use case(s). We'll review **key-value stores**, **document stores**, **wide column stores**, and **graph databases** in the next section.
|
||||
|
||||
#### Key-value store
|
||||
|
||||
|
@ -979,7 +979,7 @@ A key-value store is the basis for more complex systems such as a document store
|
|||
|
||||
A document store is centered around documents (XML, JSON, binary, etc), where a document stores all information for a given object. Document stores provide APIs or a query language to query based on the internal structure of the document itself. *Note, many key-value stores include features for working with a value's metadata, blurring the lines between these two storage types.*
|
||||
|
||||
Based on the underlying implementation, documents are organized in either collections, tags, metadata, or directories. Although documents can be organized or grouped together, documents may have fields that are completely different from each other.
|
||||
Based on the underlying implementation, documents are organized by collections, tags, metadata, or directories. Although documents can be organized or grouped together, documents may have fields that are completely different from each other.
|
||||
|
||||
Some document stores like [MongoDB](https://www.mongodb.com/mongodb-architecture) and [CouchDB](https://blog.couchdb.org/2016/08/01/couchdb-2-0-architecture/) also provide a SQL-like language to perform complex queries. [DynamoDB](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) supports both key-values and documents.
|
||||
|
||||
|
@ -1004,7 +1004,7 @@ Document stores provide high flexibility and are often used for working with occ
|
|||
|
||||
A wide column store's basic unit of data is a column (name/value pair). A column can be grouped in column families (analogous to a SQL table). Super column families further group column families. You can access each column independently with a row key, and columns with the same row key form a row. Each value contains a timestamp for versioning and for conflict resolution.
|
||||
|
||||
Google introduced [Bigtable](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/chang06bigtable.pdf) as the first wide column store, which influenced the open-source [HBase](https://www.mapr.com/blog/in-depth-look-hbase-architecture) often-used in the Hadoop ecosystem, and [Cassandra](http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureIntro_c.html) from Facebook. Stores such as BigTable, HBase, and Cassandra maintain keys in lexicographic order, allowing efficient retrieval of selective key ranges.
|
||||
Google introduced [Bigtable](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/chang06bigtable.pdf) as the first wide column store, which influenced the open-source [HBase](https://www.mapr.com/blog/in-depth-look-hbase-architecture) often-used in the Hadoop ecosystem, and [Cassandra](http://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archIntro.html) from Facebook. Stores such as BigTable, HBase, and Cassandra maintain keys in lexicographic order, allowing efficient retrieval of selective key ranges.
|
||||
|
||||
Wide column stores offer high availability and high scalability. They are often used for very large data sets.
|
||||
|
||||
|
@ -1013,7 +1013,7 @@ Wide column stores offer high availability and high scalability. They are often
|
|||
* [SQL & NoSQL, a brief history](http://blog.grio.com/2015/11/sql-nosql-a-brief-history.html)
|
||||
* [Bigtable architecture](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/chang06bigtable.pdf)
|
||||
* [HBase architecture](https://www.mapr.com/blog/in-depth-look-hbase-architecture)
|
||||
* [Cassandra architecture](http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureIntro_c.html)
|
||||
* [Cassandra architecture](http://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archIntro.html)
|
||||
|
||||
#### Graph database
|
||||
|
||||
|
@ -1171,7 +1171,7 @@ The application is responsible for reading and writing from storage. The cache
|
|||
* Add entry to cache
|
||||
* Return entry
|
||||
|
||||
```
|
||||
```python
|
||||
def get_user(self, user_id):
|
||||
user = cache.get("user.{0}", user_id)
|
||||
if user is None:
|
||||
|
@ -1208,13 +1208,13 @@ The application uses the cache as the main data store, reading and writing data
|
|||
|
||||
Application code:
|
||||
|
||||
```
|
||||
```python
|
||||
set_user(12345, {"foo":"bar"})
|
||||
```
|
||||
|
||||
Cache code:
|
||||
|
||||
```
|
||||
```python
|
||||
def set_user(user_id, values):
|
||||
user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
|
||||
cache.set(user_id, user)
|
||||
|
@ -1225,7 +1225,7 @@ Write-through is a slow overall operation due to the write operation, but subseq
|
|||
##### Disadvantage(s): write through
|
||||
|
||||
* When a new node is created due to failure or scaling, the new node will not cache entries until the entry is updated in the database. Cache-aside in conjunction with write through can mitigate this issue.
|
||||
* Most data written might never read, which can be minimized with a TTL.
|
||||
* Most data written might never be read, which can be minimized with a TTL.
|
||||
|
||||
#### Write-behind (write-back)
|
||||
|
||||
|
@ -1296,11 +1296,11 @@ Message queues receive, hold, and deliver messages. If an operation is too slow
|
|||
|
||||
The user is not blocked and the job is processed in the background. During this time, the client might optionally do a small amount of processing to make it seem like the task has completed. For example, if posting a tweet, the tweet could be instantly posted to your timeline, but it could take some time before your tweet is actually delivered to all of your followers.
|
||||
|
||||
**Redis** is useful as a simple message broker but messages can be lost.
|
||||
**[Redis](https://redis.io/)** is useful as a simple message broker but messages can be lost.
|
||||
|
||||
**RabbitMQ** is popular but requires you to adapt to the 'AMQP' protocol and manage your own nodes.
|
||||
**[RabbitMQ](https://www.rabbitmq.com/)** is popular but requires you to adapt to the 'AMQP' protocol and manage your own nodes.
|
||||
|
||||
**Amazon SQS**, is hosted but can have high latency and has the possibility of messages being delivered twice.
|
||||
**[Amazon SQS](https://aws.amazon.com/sqs/)** is hosted but can have high latency and has the possibility of messages being delivered twice.
|
||||
|
||||
### Task queues
|
||||
|
||||
|
@ -1560,7 +1560,7 @@ Latency Comparison Numbers
|
|||
L1 cache reference 0.5 ns
|
||||
Branch mispredict 5 ns
|
||||
L2 cache reference 7 ns 14x L1 cache
|
||||
Mutex lock/unlock 100 ns
|
||||
Mutex lock/unlock 25 ns
|
||||
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
|
||||
Compress 1K bytes with Zippy 10,000 ns 10 us
|
||||
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
|
||||
|
@ -1662,7 +1662,7 @@ Handy metrics based on numbers above:
|
|||
| Data store | **Redis** - Distributed memory caching system with persistence and value types | [slideshare.net](http://www.slideshare.net/dvirsky/introduction-to-redis) |
|
||||
| | | |
|
||||
| File system | **Google File System (GFS)** - Distributed file system | [research.google.com](http://static.googleusercontent.com/media/research.google.com/zh-CN/us/archive/gfs-sosp2003.pdf) |
|
||||
| File system | **Hadoop File System (HDFS)** - Open source implementation of GFS | [apache.org](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) |
|
||||
| File system | **Hadoop File System (HDFS)** - Open source implementation of GFS | [apache.org](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) |
|
||||
| | | |
|
||||
| Misc | **Chubby** - Lock service for loosely-coupled distributed systems from Google | [research.google.com](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf) |
|
||||
| Misc | **Dapper** - Distributed systems tracing infrastructure | [research.google.com](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf)
|
||||
|
@ -1685,7 +1685,7 @@ Handy metrics based on numbers above:
|
|||
| Facebook | [Scaling memcached at Facebook](https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/key-value/fb-memcached-nsdi-2013.pdf)<br/>[TAO: Facebook’s distributed data store for the social graph](https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf)<br/>[Facebook’s photo storage](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf)<br/>[How Facebook Live Streams To 800,000 Simultaneous Viewers](http://highscalability.com/blog/2016/6/27/how-facebook-live-streams-to-800000-simultaneous-viewers.html) |
|
||||
| Flickr | [Flickr architecture](http://highscalability.com/flickr-architecture) |
|
||||
| Mailbox | [From 0 to one million users in 6 weeks](http://highscalability.com/blog/2013/6/18/scaling-mailbox-from-0-to-one-million-users-in-6-weeks-and-1.html) |
|
||||
| Netflix | [Netflix: What Happens When You Press Play?](http://highscalability.com/blog/2017/12/11/netflix-what-happens-when-you-press-play.html) |
|
||||
| Netflix | [A 360 Degree View Of The Entire Netflix Stack](http://highscalability.com/blog/2015/11/9/a-360-degree-view-of-the-entire-netflix-stack.html)<br/>[Netflix: What Happens When You Press Play?](http://highscalability.com/blog/2017/12/11/netflix-what-happens-when-you-press-play.html) |
|
||||
| Pinterest | [From 0 To 10s of billions of page views a month](http://highscalability.com/blog/2013/4/15/scaling-pinterest-from-0-to-10s-of-billions-of-page-views-a.html)<br/>[18 million visitors, 10x growth, 12 employees](http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html) |
|
||||
| Playfish | [50 million monthly users and growing](http://highscalability.com/blog/2010/9/21/playfishs-social-gaming-architecture-50-million-monthly-user.html) |
|
||||
| PlentyOfFish | [PlentyOfFish architecture](http://highscalability.com/plentyoffish-architecture) |
|
||||
|
@ -1693,7 +1693,7 @@ Handy metrics based on numbers above:
|
|||
| Stack Overflow | [Stack Overflow architecture](http://highscalability.com/blog/2009/8/5/stack-overflow-architecture.html) |
|
||||
| TripAdvisor | [40M visitors, 200M dynamic page views, 30TB data](http://highscalability.com/blog/2011/6/27/tripadvisor-architecture-40m-visitors-200m-dynamic-page-view.html) |
|
||||
| Tumblr | [15 billion page views a month](http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html) |
|
||||
| Twitter | [Making Twitter 10000 percent faster](http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster)<br/>[Storing 250 million tweets a day using MySQL](http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html)<br/>[150M active users, 300K QPS, a 22 MB/S firehose](http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html)<br/>[Timelines at scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability)<br/>[Big and small data at Twitter](https://www.youtube.com/watch?v=5cKTP36HVgI)<br/>[Operations at Twitter: scaling beyond 100 million users](https://www.youtube.com/watch?v=z8LU0Cj6BOU) |
|
||||
| Twitter | [Making Twitter 10000 percent faster](http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster)<br/>[Storing 250 million tweets a day using MySQL](http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html)<br/>[150M active users, 300K QPS, a 22 MB/S firehose](http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html)<br/>[Timelines at scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability)<br/>[Big and small data at Twitter](https://www.youtube.com/watch?v=5cKTP36HVgI)<br/>[Operations at Twitter: scaling beyond 100 million users](https://www.youtube.com/watch?v=z8LU0Cj6BOU)<br/>[How Twitter Handles 3,000 Images Per Second](http://highscalability.com/blog/2016/4/20/how-twitter-handles-3000-images-per-second.html) |
|
||||
| Uber | [How Uber scales their real-time market platform](http://highscalability.com/blog/2015/9/14/how-uber-scales-their-real-time-market-platform.html)<br/>[Lessons Learned From Scaling Uber To 2000 Engineers, 1000 Services, And 8000 Git Repositories](http://highscalability.com/blog/2016/10/12/lessons-learned-from-scaling-uber-to-2000-engineers-1000-ser.html) |
|
||||
| WhatsApp | [The WhatsApp architecture Facebook bought for $19 billion](http://highscalability.com/blog/2014/2/26/the-whatsapp-architecture-facebook-bought-for-19-billion.html) |
|
||||
| YouTube | [YouTube scalability](https://www.youtube.com/watch?v=w5WVu624fY8)<br/>[YouTube architecture](http://highscalability.com/youtube-architecture) |
|
||||
|
|
|
@ -136,7 +136,7 @@ Data flow:
|
|||
|
||||
* The **Client** sends a request to the **Web Server**
|
||||
* The **Web Server** forwards the request to the **Accounts API** server
|
||||
* The **Accounts API** server places a job on a **Queue** such as Amazon SQS or [RabbitMQ](https://www.rabbitmq.com/)
|
||||
* The **Accounts API** server places a job on a **Queue** such as [Amazon SQS](https://aws.amazon.com/sqs/) or [RabbitMQ](https://www.rabbitmq.com/)
|
||||
* Extracting transactions could take awhile, we'd probably want to do this [asynchronously with a queue](https://github.com/donnemartin/system-design-primer#asynchronism), although this introduces additional complexity
|
||||
* The **Transaction Extraction Service** does the following:
|
||||
* Pulls from the **Queue** and extracts transactions for the given account from the financial institution, storing the results as raw log files in the **Object Store**
|
||||
|
@ -182,7 +182,7 @@ For the **Category Service**, we can seed a seller-to-category dictionary with t
|
|||
|
||||
**Clarify with your interviewer how much code you are expected to write**.
|
||||
|
||||
```
|
||||
```python
|
||||
class DefaultCategories(Enum):
|
||||
|
||||
HOUSING = 0
|
||||
|
@ -199,7 +199,7 @@ seller_category_map['Target'] = DefaultCategories.SHOPPING
|
|||
|
||||
For sellers not initially seeded in the map, we could use a crowdsourcing effort by evaluating the manual category overrides our users provide. We could use a heap to quickly lookup the top manual override per seller in O(1) time.
|
||||
|
||||
```
|
||||
```python
|
||||
class Categorizer(object):
|
||||
|
||||
def __init__(self, seller_category_map, self.seller_category_crowd_overrides_map):
|
||||
|
@ -219,7 +219,7 @@ class Categorizer(object):
|
|||
|
||||
Transaction implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class Transaction(object):
|
||||
|
||||
def __init__(self, created_at, seller, amount):
|
||||
|
@ -232,7 +232,7 @@ class Transaction(object):
|
|||
|
||||
To start, we could use a generic budget template that allocates category amounts based on income tiers. Using this approach, we would not have to store the 100 million budget items identified in the constraints, only those that the user overrides. If a user overrides a budget category, which we could store the override in the `TABLE budget_overrides`.
|
||||
|
||||
```
|
||||
```python
|
||||
class Budget(object):
|
||||
|
||||
def __init__(self, income):
|
||||
|
@ -273,7 +273,7 @@ user_id timestamp seller amount
|
|||
|
||||
**MapReduce** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class SpendingByCategory(MRJob):
|
||||
|
||||
def __init__(self, categorizer):
|
||||
|
|
|
@ -126,11 +126,11 @@ To generate the unique url, we could:
|
|||
* Alternatively, we could also take the MD5 hash of randomly-generated data
|
||||
* [**Base 62**](https://www.kerstner.at/2012/07/shortening-strings-using-base-62-encoding/) encode the MD5 hash
|
||||
* Base 62 encodes to `[a-zA-Z0-9]` which works well for urls, eliminating the need for escaping special characters
|
||||
* There is only one hash result for the original input and and Base 62 is deterministic (no randomness involved)
|
||||
* There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
|
||||
* Base 64 is another popular encoding but provides issues for urls because of the additional `+` and `/` characters
|
||||
* The following [Base 62 pseudocode](http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener) runs in O(k) time where k is the number of digits = 7:
|
||||
|
||||
```
|
||||
```python
|
||||
def base_encode(num, base=62):
|
||||
digits = []
|
||||
while num > 0
|
||||
|
@ -142,7 +142,7 @@ def base_encode(num, base=62):
|
|||
|
||||
* Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
|
||||
|
||||
```
|
||||
```python
|
||||
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
|
||||
```
|
||||
|
||||
|
@ -194,7 +194,7 @@ Since realtime analytics are not a requirement, we could simply **MapReduce** th
|
|||
|
||||
**Clarify with your interviewer how much code you are expected to write**.
|
||||
|
||||
```
|
||||
```python
|
||||
class HitCounts(MRJob):
|
||||
|
||||
def extract_url(self, line):
|
||||
|
|
|
@ -97,7 +97,7 @@ The cache can use a doubly-linked list: new items will be added to the head whil
|
|||
|
||||
**Query API Server** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class QueryApi(object):
|
||||
|
||||
def __init__(self, memory_cache, reverse_index_service):
|
||||
|
@ -121,7 +121,7 @@ class QueryApi(object):
|
|||
|
||||
**Node** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class Node(object):
|
||||
|
||||
def __init__(self, query, results):
|
||||
|
@ -131,7 +131,7 @@ class Node(object):
|
|||
|
||||
**LinkedList** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class LinkedList(object):
|
||||
|
||||
def __init__(self):
|
||||
|
@ -150,7 +150,7 @@ class LinkedList(object):
|
|||
|
||||
**Cache** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class Cache(object):
|
||||
|
||||
def __init__(self, MAX_SIZE):
|
||||
|
|
|
@ -102,7 +102,7 @@ We'll use a multi-step **MapReduce**:
|
|||
* **Step 1** - Transform the data to `(category, product_id), sum(quantity)`
|
||||
* **Step 2** - Perform a distributed sort
|
||||
|
||||
```
|
||||
```python
|
||||
class SalesRanker(MRJob):
|
||||
|
||||
def within_past_week(self, timestamp):
|
||||
|
|
|
@ -83,7 +83,7 @@ Handy conversion guide:
|
|||
|
||||
* **Web server** on EC2
|
||||
* Storage for user data
|
||||
* [**MySQL Database**](https://github.com/donnemartin/system-design-primer#sql)
|
||||
* [**MySQL Database**](https://github.com/donnemartin/system-design-primer#relational-database-management-system-rdbms)
|
||||
|
||||
Use **Vertical Scaling**:
|
||||
|
||||
|
|
|
@ -62,7 +62,7 @@ Handy conversion guide:
|
|||
|
||||
Without the constraint of millions of users (vertices) and billions of friend relationships (edges), we could solve this unweighted shortest path task with a general BFS approach:
|
||||
|
||||
```
|
||||
```python
|
||||
class Graph(Graph):
|
||||
|
||||
def shortest_path(self, source, dest):
|
||||
|
@ -117,7 +117,7 @@ We won't be able to fit all users on the same machine, we'll need to [shard](htt
|
|||
|
||||
**Lookup Service** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class LookupService(object):
|
||||
|
||||
def __init__(self):
|
||||
|
@ -132,7 +132,7 @@ class LookupService(object):
|
|||
|
||||
**Person Server** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class PersonServer(object):
|
||||
|
||||
def __init__(self):
|
||||
|
@ -151,7 +151,7 @@ class PersonServer(object):
|
|||
|
||||
**Person** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class Person(object):
|
||||
|
||||
def __init__(self, id, name, friend_ids):
|
||||
|
@ -162,7 +162,7 @@ class Person(object):
|
|||
|
||||
**User Graph Service** implementation:
|
||||
|
||||
```
|
||||
```python
|
||||
class UserGraphService(object):
|
||||
|
||||
def __init__(self, lookup_service):
|
||||
|
|
|
@ -249,7 +249,7 @@ We'll introduce some components to complete the design and to address scalabilit
|
|||
|
||||
The **Fanout Service** is a potential bottleneck. Twitter users with millions of followers could take several minutes to have their tweets go through the fanout process. This could lead to race conditions with @replies to the tweet, which we could mitigate by re-ordering the tweets at serve time.
|
||||
|
||||
We could also avoid fanning out tweets from highly-followed users. Instead, we could search to find tweets for high-followed users, merge the search results with the user's home timeline results, then re-order the tweets at serve time.
|
||||
We could also avoid fanning out tweets from highly-followed users. Instead, we could search to find tweets for highly-followed users, merge the search results with the user's home timeline results, then re-order the tweets at serve time.
|
||||
|
||||
Additional optimizations include:
|
||||
|
||||
|
|
|
@ -100,7 +100,7 @@ We could store `links_to_crawl` and `crawled_links` in a key-value **NoSQL Datab
|
|||
|
||||
`PagesDataStore` is an abstraction within the **Crawler Service** that uses the **NoSQL Database**:
|
||||
|
||||
```
|
||||
```python
|
||||
class PagesDataStore(object):
|
||||
|
||||
def __init__(self, db);
|
||||
|
@ -134,7 +134,7 @@ class PagesDataStore(object):
|
|||
|
||||
`Page` is an abstraction within the **Crawler Service** that encapsulates a page, its contents, child urls, and signature:
|
||||
|
||||
```
|
||||
```python
|
||||
class Page(object):
|
||||
|
||||
def __init__(self, url, contents, child_urls, signature):
|
||||
|
@ -146,7 +146,7 @@ class Page(object):
|
|||
|
||||
`Crawler` is the main class within **Crawler Service**, composed of `Page` and `PagesDataStore`.
|
||||
|
||||
```
|
||||
```python
|
||||
class Crawler(object):
|
||||
|
||||
def __init__(self, data_store, reverse_index_queue, doc_index_queue):
|
||||
|
@ -187,7 +187,7 @@ We'll want to remove duplicate urls:
|
|||
* For smaller lists we could use something like `sort | unique`
|
||||
* With 1 billion links to crawl, we could use **MapReduce** to output only entries that have a frequency of 1
|
||||
|
||||
```
|
||||
```python
|
||||
class RemoveDuplicateUrls(MRJob):
|
||||
|
||||
def mapper(self, _, line):
|
||||
|
|
Loading…
Reference in New Issue