The goal of the project is to design a Url shortener service like bit.ly or tinyurl.com that is realtime scalable.
1. Url Shortener
Url Shortener is a service that creates a short alias URL for a corresponding long URL. So, now whenever the user visits the short URL, he will be redirected to the original URL.
Example: Let the original URL be: Long URL: https://intmain.co/horizontal-scaling-and-vertical-scaling/ Short URL is: https://tinyurl.com/tepyk8t Whenever you will click the second short url, you will automatically be redirected to the page referred by the long URL. Try Now.
2. Design Goals
The URL shortening service should have the following features:
- Mandatory Features
- Given a long URL it should be able to generate Unique Short URL.
- Given a short URL it should redirect to the original URL.
- The URL should expire after a timespan.
- Optional Features
- The service should be REST API accessible.
- It should provide analytics features i.e, how many times the URL is visited.
- A user should be able to pick a custom URL.
3. Capacity Estimation
The important thing to note is that the number of reading requests will be 1000 times more than the number of write requests.
Suppose, we are going to receive 1B(1 Billion) new URL’s shorting requests per month, with a 100:1 read/write ratio. So, the total number of redirections:
100*1B = 100B redirections
If we are going to store, each URL(long and corresponding short URL) for almost 10 years, then the numbers of URLs we are going to store are:
1B*120 = 120B URLs
Now if on average every URL is going to be of 200 bytes, then total storage required:
200B*120byte = 24TB storages 😱
4. Algorithm REST Endpoints
Let’s starts by making two functions accessible through REST API:
- create_shortURL( long_url, api_key, custom_url, expiry_date)
- long_url: A long URL that needs to be shortened.
- api_key: A unique API key provided to each user, to protect from the spammers, access, and resource control for the user, etc.
- custom_url(optional): The custom short link URL, user want to use.
- expiry_date(optional): The date after which the short URL becomes invalid.
- Return Value: The short Url generated, or error code in case of the inappropriate parameter.
- redirect_shortURL( short_url)
- short_url: The short URL generated from the above function.
- Return Value: The original long URL, or invalid URL error code.
5. Shortening Algorithm
The URL that we have to generate should be a short URL. But how much short? Also, what type of encoding should be done?🤔
URL Encoding – The encoding can be Base36( characters allowed [a-z][0-9]) or Base62( characters allowed [A-Z][a-z][0-9]) or Base64(characters allowed [A-Z][a-z][0-9],’+’,’/’). Since all of the characters in Base64 are URL safe characters, we will choose Base64 as our encoding technique.
Url Length – Using Base64 encoding if we choose the
URL with length 6, will give 64^6 = ~68.7 Billion URLs URL with length 7, will give 64^7 = ~5 Trillion URLs URL with length 8, will give 64^8 = ~281 Trillion URLs
We have already estimated that we would be storing 120B URLs, we can safely choose the short URL to be 7 characters long.
Now, to generate a unique short URL, we can calculate the MD5 hash of the long URL, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.
(MD5 result = 32character output = 32*4bit = 128 bit output. Base64 encoding will use 6 bit to represent each character, MD5 -> Base64 give (128/6) ~ 22 character output)
For choosing the short URL, first, we can randomly swap some character of the Base64 encoding result and then pick any 7 characters randomly from the result.
Potential Issues – Using the above method we can come across 2 potential issues:
- If multiple users enter the same Long URL, in the worst case they can get the same short URL.
- If the user enters the URL-encoded URL i.e, https://intmain.co/system-design?rel=source (UTF-8 format) and https%3A%2F%2Fintmain.co%2Fsystem-design%3Frel%3Dsource (URL-encoded format). Both URLs are the same but encoding is different. Check here.
Let’s solve the first issue:
We can append the user IP address to a the long URL and then do the shortening porcess, in this way we won’t get redundency. Correct?🤔
No, two users can have same public IP addresses if they are connected via same router.
One solution is to add user-id or API key to the long URL and then do the shortening. This will work fine, but user have to be logged in to create short URL. Let’s stick to this for now.
Let’s solve the second issue:
For every request recieved we will use UTF-8 encoding only. If a URL is URL-encoded, we will first convert it into UTF-8 format. For this we can have our own service or we can use some 3rd party services.
6. Database Design & Choice
Let’s see the data we need to store:
Data Related to user
- User ID: A unique user id or API key to make user globally distinguishable.
- Name: The name of the user.
- Email: The email id of the user
- Password: Password of the user to facilitate login feature.
- Creation Date: The date on which the user was registered.
Data Related to ShortLink
- Short Url: 7 character long unique short URL.
- Original Url: The original long URL.
- UserId: The unique user id or API key of the user who created the short URL.
- Expiration Date: The date after which this short URL should become invalid.
Data Storage Observation
- We will be storing Billions of links.
- The short link redirection should be fast.
- Our service is going to be read heavily.
- The only relation that is going to exist is which user created which URL and that too is going to accessed very less.
We have two different choices of databases: 1) Relational Databases(MySQL) 2) NoSQL Databases( Cassandra).
In general, Relational Databases are good if we have lots of complex queires involving joins, but they are slow. NoSQL databases are pathetic at handling the relationship queries but they are faster.
Now, we don’t really need lots of relationship among data, but we do need the fast read and write speed. Hence we will choose NoSQL Database. The key for each row can be the shorturl, because it is going to be globally unique.
Database Sharding
To scale out our database, we need to partition it into several machines or nodes, so that it can store information about billions of URLs. Hence now we can store more data in memory because of more machines or nodes. For database sharding we will use hash based partition technique.
In this technique, we will find the hash of the shorturl we are going to store, and determine the machine/shard in which we are going to store this particular URL. The hash function will randomly distribute the URLs into different partation or shard. We can decide the number of shards we are going to make and then we can choose appropriate hash function that random number representing the partation/shard number.( Ex if we have chosen 512 shards, then hash function will generate a number between [1-512] or [0-511] ).
7. Speeding Up the Read Operation
We know that our database is going to be read heavily. Till now we have find the way to speed up the writing process, but the reads are still slow. So we have to find some way to speed up the reading process.
Caching is the solution. We can cache the URLs that are going to be accessed frequently. For example, a url that appears on the trending page of any social networking website. Hence many people are going to visit the url. We can use the caching servies like memcached.
Things we need to consider after adding the caching layer:
- When the cache is full, we need to replace the URLs in cache with the treanding ones. For this we will use LRU(Least Recently Used) policy. The URL in the cace which have been refered the least number of times will be removed.
- Synchronizing the cache with the original URL. If the user updates or deletes the original link, the corresponding changes has to be reflected in the cache too.
- We can shared the cache too. This will help us store more data in memory because of the more machines. For deciding which thing go to which shard can be done using “Hashing” or “Consistant Hashing”.
Now this is now we have speed up our read and write request, but still our system is prone to network bandwidth bottleneck and single point of failure.
8. Load Balancing
To overcome the problem of limited network bandwidth and single point of failue in url shortener service, we will use Load Balancers. Load Balancer does its magic by diving the traffic among a group of server thus resulting in improved response and availability of a website or application. Read more here…
To distriubte the load among server we will use the Least Bandwidth Method. This algorithm will choose the server currently serving the least amount of traffic, measured in megabits per second (Mbps).
We can place the Load Balancers between:
- The client and the server.
- The server and the database.
This is how we will design a highly scalable URL Shortener that can work in realtime. 👨💻
Do you know System Design is a very important topic in product based company interviews and almost every tech giant asks it? At Nlogn we have a dedicated section for system design to help you prepare. Read here.
Helpful information. Fortunate me I found your web site by chance,
and I am stunned why this twist of fate did not happen earlier!
I bookmarked it.
Indeed, simple and impressive sharing. These little things will definitely help you out, in your future.
Here is a tool, I would like to recommend, you also check it out
https://url-decode.com/
There’s definitely a great deal to learn about this issue.
I like all of the points you made.
This web site really has all the info I wanted concerning
this subject and didn’t know who to ask.
how to meet the custom URL requirement?
With billions of requests, how will you avoid generation of duplicate short urls? It may not be efficient to check if an url exists in db before adding due to concurrent writes.
7. Speeding Up the Read Operation
how to use LRU when using memcached.?