2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								# Design a web crawler
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*Note: This document links directly to relevant areas found in the [system design topics ](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics ) to avoid duplication.  Refer to the linked content for general talking points, tradeoffs, and alternatives.*
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								## Step 1: Outline use cases and constraints
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								>  Gather requirements and scope the problem.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								>  Ask questions to clarify use cases and constraints.
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								>  Discuss assumptions.
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Use cases
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								#### We'll scope the problem to handle only the following use cases
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  **Service** crawls a list of urls:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Generates reverse index of words to pages containing the search terms
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Generates titles and snippets for pages
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Title and snippets are static, they do not change based on search query
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  **User** inputs a search term and sees a list of relevant pages with titles and snippets  the crawler generated
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Only sketch high level components and interactions for this use case, no need to go into depth
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  **Service** has high availability
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								#### Out of scope
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  Search analytics
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  Personalized search results
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  Page rank
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Constraints and assumptions
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								#### State assumptions
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  Traffic is not evenly distributed
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Some searches are very popular, while others are only executed once
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  Support only anonymous users
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  Generating search results should be fast
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  The web crawler should not get stuck in an infinite loop
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  We get stuck in an infinite loop if the graph contains a cycle
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  1 billion links to crawl
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Pages need to be crawled regularly to ensure freshness
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Average refresh rate of about once per week, more frequent for popular sites
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  4 billion links crawled each month
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Average stored size per web page: 500 KB
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  For simplicity, count changes the same as new pages
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  100 billion searches per month
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Exercise the use of more traditional systems - don't use existing systems such as [solr ](http://lucene.apache.org/solr/ ) or [nutch ](http://nutch.apache.org/ ).
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								#### Calculate usage
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								**Clarify with your interviewer if you should run back-of-the-envelope usage calculations.**
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  2 PB of stored page content per month
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  500 KB per page * 4 billion links crawled per month
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  72 PB of stored page content in 3 years
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  1,600 write requests per second
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  40,000 search requests per second
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Handy conversion guide:
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  2.5 million seconds per month
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  1 request per second = 2.5 million requests per month
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  40 requests per second = 100 million requests per month
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  400 requests per second = 1 billion requests per month
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								## Step 2: Create a high level design
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								>  Outline a high level design with all important components.
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								## Step 3: Design core components
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								>  Dive into details for each core component.
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Use case: Service crawls a list of urls
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-07-05 16:48:23 +02:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We'll assume we have an initial list of `links_to_crawl`  ranked initially based on overall site popularity.  If this is not a reasonable assumption, we can seed the crawler with popular sites that link to outside content such as [Yahoo ](https://www.yahoo.com/ ), [DMOZ ](http://www.dmoz.org/ ), etc.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We'll use a table `crawled_links`  to store processed links and their page signatures.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We could store `links_to_crawl`  and `crawled_links`  in a key-value **NoSQL Database** .  For the ranked links in `links_to_crawl` , we could use [Redis ](https://redis.io/ ) with sorted sets to maintain a ranking of page links.  We should discuss the [use cases and tradeoffs between choosing SQL or NoSQL ](https://github.com/donnemartin/system-design-primer#sql-or-nosql ).
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  The **Crawler Service**  processes each page link by doing the following in a loop:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Takes the top ranked page link to crawl
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Checks `crawled_links`  in the **NoSQL Database**  for an entry with a similar page signature
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            *  If we have a similar page, reduces the priority of the page link
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  This prevents us from getting into a cycle
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  Continue
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            *  Else, crawls the link
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  Adds a job to the **Reverse Index Service**  queue to generate a [reverse index ](https://en.wikipedia.org/wiki/Search_engine_indexing )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  Adds a job to the **Document Service**  queue to generate a static title and snippet
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  Generates the page signature
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  Removes the link from `links_to_crawl`  in the **NoSQL Database** 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                *  Inserts the page link and signature to `crawled_links`  in the **NoSQL Database** 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								**Clarify with your interviewer how much code you are expected to write**.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								`PagesDataStore`  is an abstraction within the **Crawler Service**  that uses the **NoSQL Database** :
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2019-05-07 06:24:41 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								```python
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								class PagesDataStore(object):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def __init__ (self, db);
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.db = db
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def add_link_to_crawl(self, url):
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Add the given link to `links_to_crawl` ."""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def remove_link_to_crawl(self, url):
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Remove the given link from `links_to_crawl` ."""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def reduce_priority_link_to_crawl(self, url)
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Reduce the priority of a link in `links_to_crawl`  to avoid cycles."""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def extract_max_priority_page(self):
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Return the highest priority link in `links_to_crawl` ."""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def insert_crawled_link(self, url, signature):
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Add the given link to `crawled_links` ."""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def crawled_similar(self, signature):
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Determine if we've already crawled a page matching the given signature"""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								`Page`  is an abstraction within the **Crawler Service**  that encapsulates a page, its contents, child urls, and signature:
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2019-05-07 06:24:41 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								```python
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								class Page(object):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def __init__ (self, url, contents, child_urls, signature):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.url = url
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.contents = contents
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.child_urls = child_urls
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.signature = signature
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								`Crawler`  is the main class within **Crawler Service** , composed of `Page`  and `PagesDataStore` .
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2019-05-07 06:24:41 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								```python
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								class Crawler(object):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def __init__ (self, data_store, reverse_index_queue, doc_index_queue):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.data_store = data_store
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.reverse_index_queue = reverse_index_queue
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.doc_index_queue = doc_index_queue
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def create_signature(self, page):
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								        """Create signature based on url and contents."""
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								        ...
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def crawl_page(self, page):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        for url in page.child_urls:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            self.data_store.add_link_to_crawl(url)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        page.signature = self.create_signature(page)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.data_store.remove_link_to_crawl(page.url)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        self.data_store.insert_crawled_link(page.url, page.signature)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def crawl(self):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        while True:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            page = self.data_store.extract_max_priority_page()
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            if page is None:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                break
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            if self.data_store.crawled_similar(page.signature):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                self.data_store.reduce_priority_link_to_crawl(page.url)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            else:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								                self.crawl_page(page)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Handling duplicates
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We need to be careful the web crawler doesn't get stuck in an infinite loop, which happens when the graph contains a cycle.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								**Clarify with your interviewer how much code you are expected to write**.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We'll want to remove duplicate urls:
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  For smaller lists we could use something like `sort | unique` 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  With 1 billion links to crawl, we could use **MapReduce**  to output only entries that have a frequency of 1
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2019-05-07 06:24:41 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								```python
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								class RemoveDuplicateUrls(MRJob):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def mapper(self, _, line):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        yield line, 1
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    def reducer(self, key, values):
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        total = sum(values)
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        if total == 1:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								            yield key, total
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Detecting duplicate content is more complex.  We could generate a signature based on the contents of the page and compare those two signatures for similarity.  Some potential algorithms are [Jaccard index ](https://en.wikipedia.org/wiki/Jaccard_index ) and [cosine similarity ](https://en.wikipedia.org/wiki/Cosine_similarity ).
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Determining when to update the crawl results
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Pages need to be crawled regularly to ensure freshness.  Crawl results could have a `timestamp`  field that indicates the last time a page was crawled.  After a default time period, say one week, all pages should be refreshed.  Frequently updated or more popular sites could be refreshed in shorter intervals.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Although we won't dive into details on analytics, we could do some data mining to determine the mean time before a particular page is updated, and use that statistic to determine how often to re-crawl the page.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We might also choose to support a `Robots.txt`  file that gives webmasters control of crawl frequency.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Use case: User inputs a search term and sees a list of relevant pages with titles and snippets
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  The **Client**  sends a request to the **Web Server** , running as a [reverse proxy ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  The **Web Server**  forwards the request to the **Query API**  server
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  The **Query API**  server does the following:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Parses the query
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Removes markup
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Breaks up the text into terms
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Fixes typos
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Normalizes capitalization
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  Converts the query to use boolean operations
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Uses the **Reverse Index Service**  to find documents matching the query
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								        *  The **Reverse Index Service**  ranks the matching results and returns the top ones
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Uses the **Document Service**  to return titles and snippets
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We'll use a public [**REST API** ](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest ):
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								$ curl https://search.com/api/v1/search?query=hello+world
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Response:
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								{
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "title": "foo's title",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "snippet": "foo's snippet",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "link": "https://foo.com",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								},
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								{
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "title": "bar's title",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "snippet": "bar's snippet",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "link": "https://bar.com",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								},
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								{
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "title": "baz's title",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "snippet": "baz's snippet",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    "link": "https://baz.com",
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								},
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								```
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								For internal communications, we could use [Remote Procedure Calls ](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc ).
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								## Step 4: Scale the design
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								>  Identify and address bottlenecks, given the constraints.
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								**Important: Do not simply jump right into the final design from the initial design!**
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								State you would 1) **Benchmark/Load Test** , 2) **Profile**  for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat.  See [Design a system that scales to millions of users on AWS ](../scaling_aws/README.md ) as a sample on how to iteratively scale the initial design.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them.  For example, what issues are addressed by adding a **Load Balancer**  with multiple **Web Servers** ?  **CDN** ?  **Master-Slave Replicas** ?  What are the alternatives and **Trade-Offs**  for each?
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								We'll introduce some components to complete the design and to address scalability issues.  Internal load balancers are not shown to reduce clutter.
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*To avoid repeating discussions*, refer to the following [system design topics ](https://github.com/donnemartin/system-design-primer#index-of-system-design-topics ) for main talking points, tradeoffs, and alternatives:
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  [DNS ](https://github.com/donnemartin/system-design-primer#domain-name-system )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Load balancer ](https://github.com/donnemartin/system-design-primer#load-balancer )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Horizontal scaling ](https://github.com/donnemartin/system-design-primer#horizontal-scaling )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Web server (reverse proxy) ](https://github.com/donnemartin/system-design-primer#reverse-proxy-web-server )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [API server (application layer) ](https://github.com/donnemartin/system-design-primer#application-layer )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Cache ](https://github.com/donnemartin/system-design-primer#cache )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [NoSQL ](https://github.com/donnemartin/system-design-primer#nosql )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Consistency patterns ](https://github.com/donnemartin/system-design-primer#consistency-patterns )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Availability patterns ](https://github.com/donnemartin/system-design-primer#availability-patterns )
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Some searches are very popular, while others are only executed once.  Popular queries can be served from a **Memory Cache**  such as Redis or Memcached to reduce response times and to avoid overloading the **Reverse Index Service**  and **Document Service** .  The **Memory Cache**  is also useful for handling the unevenly distributed traffic and traffic spikes.  Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.< sup >< a  href = https://github.com/donnemartin/system-design-primer #latency -numbers-every-programmer-should-know > 1</ a ></ sup > 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Below are a few other optimizations to the **Crawling Service** :
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-07-04 10:53:56 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  To handle the data size and request load, the **Reverse Index Service**  and **Document Service**  will likely need to make heavy use sharding and federation.
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  DNS lookup can be a bottleneck, the **Crawler Service**  can keep its own DNS lookup that is refreshed periodically
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  The **Crawler Service**  can improve performance and reduce memory usage by keeping many open connections at a time, referred to as [connection pooling ](https://en.wikipedia.org/wiki/Connection_pool )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Switching to [UDP ](https://github.com/donnemartin/system-design-primer#user-datagram-protocol-udp ) could also boost performance
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  Web crawling is bandwidth intensive, ensure there is enough bandwidth to sustain high throughput
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								## Additional talking points
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								>  Additional topics to dive into, depending on the problem scope and time remaining.
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### SQL scaling patterns
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  [Read replicas ](https://github.com/donnemartin/system-design-primer#master-slave-replication )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Federation ](https://github.com/donnemartin/system-design-primer#federation )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Sharding ](https://github.com/donnemartin/system-design-primer#sharding )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Denormalization ](https://github.com/donnemartin/system-design-primer#denormalization )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [SQL Tuning ](https://github.com/donnemartin/system-design-primer#sql-tuning )
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								#### NoSQL
 
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  [Key-value store ](https://github.com/donnemartin/system-design-primer#key-value-store )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Document store ](https://github.com/donnemartin/system-design-primer#document-store )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Wide column store ](https://github.com/donnemartin/system-design-primer#wide-column-store )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Graph database ](https://github.com/donnemartin/system-design-primer#graph-database )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [SQL vs NoSQL ](https://github.com/donnemartin/system-design-primer#sql-or-nosql )
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Caching
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  Where to cache
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Client caching ](https://github.com/donnemartin/system-design-primer#client-caching )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [CDN caching ](https://github.com/donnemartin/system-design-primer#cdn-caching )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Web server caching ](https://github.com/donnemartin/system-design-primer#web-server-caching )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Database caching ](https://github.com/donnemartin/system-design-primer#database-caching )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Application caching ](https://github.com/donnemartin/system-design-primer#application-caching )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  What to cache
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Caching at the database query level ](https://github.com/donnemartin/system-design-primer#caching-at-the-database-query-level )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Caching at the object level ](https://github.com/donnemartin/system-design-primer#caching-at-the-object-level )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  When to update the cache
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Cache-aside ](https://github.com/donnemartin/system-design-primer#cache-aside )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Write-through ](https://github.com/donnemartin/system-design-primer#write-through )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Write-behind (write-back) ](https://github.com/donnemartin/system-design-primer#write-behind-write-back )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  [Refresh ahead ](https://github.com/donnemartin/system-design-primer#refresh-ahead )
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Asynchronism and microservices
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  [Message queues ](https://github.com/donnemartin/system-design-primer#message-queues )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Task queues ](https://github.com/donnemartin/system-design-primer#task-queues )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Back pressure ](https://github.com/donnemartin/system-design-primer#back-pressure )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Microservices ](https://github.com/donnemartin/system-design-primer#microservices )
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Communications
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  Discuss tradeoffs:
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  External communication with clients - [HTTP APIs following REST ](https://github.com/donnemartin/system-design-primer#representational-state-transfer-rest )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								    *  Internal communications - [RPC ](https://github.com/donnemartin/system-design-primer#remote-procedure-call-rpc )
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  [Service discovery ](https://github.com/donnemartin/system-design-primer#service-discovery )
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Security
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								Refer to the [security section ](https://github.com/donnemartin/system-design-primer#security ).
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Latency numbers
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								See [Latency numbers every programmer should know ](https://github.com/donnemartin/system-design-primer#latency-numbers-every-programmer-should-know ).
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								### Ongoing
 
							 
						 
					
						
							
								
									
										
										
										
											2017-03-04 21:06:58 -08:00 
										
									 
								 
							 
							
								
							 
							
								 
							 
							
							
								
							 
						 
					
						
							
								
									
										
										
										
											2020-03-09 21:46:02 -04:00 
										
									 
								 
							 
							
								
									
										 
									 
								
							 
							
								 
							 
							
							
								*  Continue benchmarking and monitoring your system to address bottlenecks as they come up
							 
						 
					
						
							
								
							 
							
								
							 
							
								 
							 
							
							
								*  Scaling is an iterative process