Rabin-Karp is a pattern-matching algorithm that works by calculating the hash of the pattern to be searched(say Length M) and the hash of M characters from the given text. If the hash values are the same, then it matches the individual M-character sequence. If the hash values are not equal then it will calculate the hash value for the next M-character sequence of a given text.
Hence the Rabin-Karp algorithm involves the calculation of the hash value for –
- Pattern (to be searched) of Length M
- All substrings of the given text of Length M.
Naive Approach
Let’s say we are given a text of length N. The naive approach would be to calculate the hash of 0 to M characters (hash(x)=hash(text[0……M]) and if the match doesn’t occur, we will calculate the next M characters ranging from 1 to M+1 and so on.
Naive_Rabin_Karp(text, pattern){
n = length(text)
m = length(pattern)
//iterative algorithm to calculate hash in O(M) time
hash_patt = hash(pattern, 0, m);
for(i=0; i<N; i++){
hash_text = hash(text, i, m+i)
if(hash_text == hash_patt)
{
for(j=0; i<M; j++){
if(text[j+i] != pattern[i])
break;
}
if(j == M)
print("Pattern found at" +i)
}
}
}
Optimized Approach
We can choose the hash function in such a way that rehashing a string will take only O(1) time, but calculating hash for the first substring will still take O(M) time.
The hash function we will use will be the rolling hash. We will first start by calculating the integer value for each character. Say we have a string containing combinations of characters in range (a-z) or (A-Z), we can assign each character is converted to value equivalent to its position in range 1 to 26, such a way that a = 1, b=2, c=3 …… z = 26. Example – “aab” = 112, “bde”= 245 etc.
Next, we will choose a number to multiply each character in a way explained below –
Hash(“c1c2…cm“) = (cm*pm-1 + cm-1*pm-2 + ……………c1*pm-1)
Since the hash value generated will be very big, so we can perform a modulus with a prime number to get the final result in a given range. The prime number chosen for modulus should not be very small as it will be repeated for lots of text substrings and it should not be too big that the purpose of using modulus is lost.
Hash(“c1c2…cm“) = (cm*pm-1 + cm-1*pm-2 + ……………c1*pm-1 ) mod q
Now, have chosen a hash function calculating first hash will take O(M) time but the rehashing will take only O(1) time.
/*note here he modulo q has been excluded readability and simplicity, but the calculations still hold when modulo q will be present */ Example -> text = "abcde", M = 3, p = 26 hash(abc) = c*26²+ b*26¹+ a*26° = 3*26²+ 2*26¹ + 1*26° = 2028 + 52 + 1 = 2081 hash(bcd) = hash(d)*26² + (hash(abc) - hash(a))/26 = 4*26² +( 3*26²+ 2*26¹ + 1 - 1)/26 = 4*26² + 3*26¹ + 2 = 2784 hash(cde) = hash(e)*26² + (hash(bcd) - hash(b))/26 = 5*26² +( 2784 - 2)/26 = 4*26² + 107 = 2811
Let’s see how this optimized algorithm can be implemented.
Implementation of the above algorithm
//Rabin-Karp Algorithm implementation in CPP #include <bits/stdc++.h> using namespace std; #define p 26 void Rabin_Karp(char pat[], char txt[], int q) { int M = strlen(pat); int N = strlen(txt); int patt_hash = 0; //hash value of pattern int text_hash = 0; //hash value of text int hash = 1; for(int i=0; i<M-1; i++) hash = (hash*p) % q; for (int i = 0; i < M; i++) { patt_hash = (p * patt_hash + pat[i]) % q; text_hash = (p * text_hash + txt[i]) % q; } for (int i=0; i<=N-M; i++) { if (patt_hash == text_hash) { int j; for (j = 0; j < M; j++) { if (txt[i+j] != pat[j]) break; } if (j == M) cout<<"Pattern found at index "<< i<<endl; } if ( i < N-M ) { text_hash = (p*(text_hash - txt[i]*hash) + txt[i+M])%q; if (text_hash < 0) text_hash = (text_hash + q); } } } int main() { char text[] = "nlognbestportalforcsisnlogn"; char pattern[] = "nlogn"; int q = 99; Rabin_Karp(pattern, text, q); return 0; }
Time complexity
The average and best-case time complexity of the Rabin-Karp algorithm are O(N+M), where M is the length of pattern we have to search, and N is the length of the given text. But the worst-case time complexity is O(NM). The worst case of the Rabin-Karp algorithm occurs when all characters of pattern and text are the same as the hash values of all the substrings of text matches with the hash value of pattern.
For example pat[] = “AAA” and txt[] = “AAAAAAA”.