*Note: This document links directly to relevant areas found in the [system design topics](https://github.com/ido777/system-design-primer-update.git#index-of-system-design-topics) to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.*
[Design A URL shortener](../url_shortener/README.md) - e.g. [TinyURL](https://tinyurl.com/), [bit.ly](https://bit.ly/) is a related question, since pastebin requires storing the paste contents instead of the original unshortened url. However, the URL shortener question is more focused on the shortlink generation and redirection, while the pastebin question is more focused on the storage and retrieval of the paste contents.
## Step 1: Investigate the problem, use cases and constraints and establish design scope
> Gather main functional requirements and scope the problem.
> Ask questions to clarify use cases and constraints.
> Discuss assumptions.
Adding clarifying questions is the first step in the process.
Remember your goal is to understand the problem and establish the design scope.
### What questions should you ask to clarify the problem?
Here is an example of the dialog you could have with the **Interviewer**:
### 🔍 Example Dialogue (Pastebin)
**Interviewer**: Design Pastebin.com.
**Candidate**: To clarify and understand the scope, may I start with a few quick questions?
**Interviewer**: Yes, please.
**Candidate**: What are the **main features, users, and use cases** of the system?
**Interviewer**: Pastebin is a simple site to share plain text. Users paste content and get a short link to share. Pastebin includes anonymous users as well as logged in users.
**Candidate**: Great. So **can we scope the problem to 2 main flows**:
1. User creates a paste and gets a link to share.
2. User accesses a paste using the link.
**Interviewer**: Yes, that's a good start.
**Candidate**: What are the **other** important topics we need to consider for the basic **MVP** functionality?
**Interviewer**: Pastebin supports two types of users:
* Anonymous users - who can create and share content without an account
* Authenticated users - who can create content and customize their sharing links (e.g., custom URLs, expiration dates, access controls).
**Candidate**: Understood. For this phase, can we **focus** on Anonymous users?
**Interviewer**: You mean that we should ignore the requirements for authenticated users?
**Candidate**: No, I want to **clarify**. I am suggesting and **asking** your confirmation - to be effective I think we can start dealing with the main flows of the system. We will bear in mind the authenticated requirement and **deal with them later on**.
**Interviewer**: Later on in the interview, or later on in the product life cycle?
**Candidate**: Both. Let me explain. Let's for now assume that we will use REST API to write and read the content. We can have 2 different api endpoints for anonymous and authenticated users, and we can have shared functions that can be used by both. This way we can **focus** on the main flows of the system. On the other hand we can talk about the authenticated requirements now, such as [OAuth 2.0](https://oauth.net/2/) of [JWT](https://jwt.io/).
**Interviewer**: Ok.
**Candidate**: So for now, is it ok to focus on the anonymous users?
**Interviewer**: Yes.
**Candidate**: What are the *other* important topics we need to consider? What about traffic assumptions / load?
**Interviewer**: 10M writes per month, 100M reads.
**Candidate**: Got it. High read-to-write ratio. Any *other* specific requirements, assumptions or constraints, data flows?
**Interviewer**: We do track monthly stats. Links can have optional expiration. Expired pastes are auto-deleted.
**Candidate**: Is there **anything more** we should discuss in terms of latency, availability, or other non-functional constraints?
**Interviewer**: Reads should be low-latency. High availability is expected.
**Candidate**: Cost efficiency, scaling and security matter, but I **suggest** to digest those in the next phases.
**Interviewer**: Ok.
**Candidate**: Thanks — that's clear and helps scope my design. Let's me summarize the scope as i understood it and the assumptions to make sure we are on the **same page**.
**Interviewer**: Ok.
### 🔍 Example Breakdown
**Candidate**: ok, here a reflection of what I understood from the requirements, I will write it down here
> ### Use cases
> * **User** enters a block of text and gets a randomly generated link
> * **User** enters a paste's url and views the contents
> * **User** is anonymous
>
> ### Background tasks
> * **Service** tracks analytics of pages
> * Monthly visit stats
> * **Service** deletes expired pastes
>
> ### Non functional requirements
> * **Service** has high availability
>
> ### Out of scope
> * All authenticated users features
> * Any other requirements
>
> ### Other considerations we need to think about (Later on)
Rather than diving into implementation, this diagram tells a story:
It reflects usage patterns (10:1 read/write). This is why we have different components for write and read.
It separates latency-sensitive vs. async processing. Analytics is async processing so it gets its own component.
It shows readiness for growth without premature optimization. Write with load balancer, read with cache.
It creates a solid skeleton that supports further discussion on reverse proxy, caching, sharding, CDN integration, or even queueing systems for analytics—all while staying grounded in the problem as scoped.
You should ask for a feedback after you present the diagram, and get buy-in and some initial ideas about areas to dive into, based on the feedback.
We could use a [relational database](https://github.com/ido777/system-design-primer-update.git#relational-database-management-system-rdbms) as a large hash table, mapping the generated url to a file server and path containing the paste file.
Instead of managing a file server, we could use a managed **Object Store** such as Amazon S3 or a [NoSQL document store](https://github.com/ido777/system-design-primer-update.git#document-store).
An alternative to a relational database acting as a large hash table, we could use a [NoSQL key-value store](https://github.com/ido777/system-design-primer-update.git#key-value-store). We should discuss the [tradeoffs between choosing SQL or NoSQL](https://github.com/ido777/system-design-primer-update.git#sql-or-nosql). The following discussion uses the relational database approach.
* The **Client** sends a create paste request to the **Web Server**, running as a [reverse proxy](https://github.com/ido777/system-design-primer-update.git#reverse-proxy-web-server)
Setting the primary key to be based on the `shortlink` column creates an [index](https://github.com/ido777/system-design-primer-update.git#use-good-indices) that the database uses to enforce uniqueness. We create an additional index on `created_at` so the database can locate pastes created within a time range without full table scans. Since indexes are typically implemented with B-trees, index lookup is O(log n) instead of O(n). Frequently accessed indexes (like by recent timestamps) are often cached automatically in RAM by the database’s internal cache and since the indexes are smaller, they are likely to stay in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.<sup><ahref=https://github.com/ido777/system-design-primer-update.git#latency-numbers-every-programmer-should-know>1</a></sup>
* Alternatively, we could also take the MD5 hash of randomly-generated data
* [**Base 62**](https://www.kerstner.at/2012/07/shortening-strings-using-base-62-encoding/) encode the MD5 hash
* Base 62 encodes to `[a-zA-Z0-9]` which works well for urls, eliminating the need for escaping special characters
* There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
* Base 64 is another popular encoding but provides issues for urls because of the additional `+` and `/` characters
* The following [Base 62 pseudocode](http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener) runs in O(k) time where k is the number of digits = 7:
* Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
While traditional MapReduce jobs are rarely written manually today, the underlying pattern — mapping, grouping, and reducing data — is still everywhere. For website analytics, we typically use SQL engines like BigQuery or Athena for batch queries, or streaming frameworks like Flink for real-time aggregation, depending on data freshness needs and scale.
For educational purposes and small local testing, we can simulate MapReduce logic using Python. This is **not how production systems work today**, but it is useful for **understanding the concepts** and explaining it to the interviewer.
To delete expired pastes, we could just scan the **SQL Database** for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
To summarize, we've designed a text snippet sharer system that meets the core requirements. We've discussed the high-level design, identified potential bottlenecks, and proposed solutions to address scalability issues. Now it is time to align again with the interviewer expectations.
See if she has any feedback or questions, suggest next steps, improvements, error handling, and monitoring if appropriate.
*To avoid repeating discussions*, refer to the following [system design topics](https://github.com/ido777/system-design-primer-update.git#index-of-system-design-topics) for main talking points, tradeoffs, and alternatives:
To address the 40 *average* read requests per second (higher at peak), traffic for popular content should be handled by the **Memory Cache** instead of the database. The **Memory Cache** is also useful for handling the unevenly distributed traffic and traffic spikes. The **SQL Read Replicas** should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
4 *average* paste writes per second (with higher at peak) should be do-able for a single **SQL Write Master-Slave**. Otherwise, we'll need to employ additional SQL scaling patterns:
* External communication with clients - [HTTP APIs following REST](https://github.com/ido777/system-design-primer-update.git#representational-state-transfer-rest)
See [Latency numbers every programmer should know](https://github.com/ido777/system-design-primer-update.git#latency-numbers-every-programmer-should-know).