shivajikobardan
Junior Member
- Joined
- Nov 1, 2021
- Messages
- 107
Background:
Problem-:
Consider a very large collection of documents, say web pages crawled from the entire internet. The problem is to determine the frequency (i.e., total number of occurrences) of each word in this collection. Thus, if there are n documents and m distinct words, we wish to determine m frequencies, one for each word.
Method followed:
Now we compare two approaches to compute these frequencies in parallel using p
Processors:
(a) let each processor compute the frequencies for m/p words and
(b) let each processor compute the frequencies of m words across n/p documents, followed by all the processors summing their results.
Assumptions:
We assume a distributed-memory model with a shared disk, so that each processor is able to access any document from disk in parallel with no contention. Further we assume that the time spent c for reading each word in the document is the same as that of sending it to another processor via interprocessor communication. On the other hand, the time to add to a running
total of frequencies is negligible as compared to the time spent on a disk read or interprocessor communication, so we ignore the time taken for arithmetic additions in our analysis. Finally, assume that each word occurs f times in a document, on average.
Solution:
Let us take 4 documents-:
D1: apple is good.
D2: doctor likes apple.
D3: apple keeps doctor away.
D4: doctor sorry apple.
So, here are 4 documents, the value of n=4.
Here are 7 distinct words, the value of m=7.
Assume 7 processors, so p=7.
Serial time for computing frequencies of each word is?
Mine answer-:
n documents*m words*f words/document* c read_time/(word)
=nmfc words*read_time
? why are the units like this? It should be just read_time in units I think.
Book's answer-:
Now, to calculate efficiency, we also need to calculate parallel time for case a).
Mine answer-:
efficiency=serial time/(p*parallel time)
= nmfc/(p*parallel time)
Parallel time=?
=m/p words * n documents * c read_time/word * f words/document
= nmfc/p word*read_time
So efficiency=1 ......................................................................(WRONG as per the book)
Book's answer-:
I don't understand a single word after this. It all went out of my head.
Why're we adding 2 different values except nmfc(unlike above case a) here? I don't get it.
Finding whether it's scalable or not is easy as you can just look at the initial guidelines, and see the answer and determine.
Problem-:
Consider a very large collection of documents, say web pages crawled from the entire internet. The problem is to determine the frequency (i.e., total number of occurrences) of each word in this collection. Thus, if there are n documents and m distinct words, we wish to determine m frequencies, one for each word.
Method followed:
Now we compare two approaches to compute these frequencies in parallel using p
Processors:
(a) let each processor compute the frequencies for m/p words and
(b) let each processor compute the frequencies of m words across n/p documents, followed by all the processors summing their results.
Assumptions:
We assume a distributed-memory model with a shared disk, so that each processor is able to access any document from disk in parallel with no contention. Further we assume that the time spent c for reading each word in the document is the same as that of sending it to another processor via interprocessor communication. On the other hand, the time to add to a running
total of frequencies is negligible as compared to the time spent on a disk read or interprocessor communication, so we ignore the time taken for arithmetic additions in our analysis. Finally, assume that each word occurs f times in a document, on average.
Solution:
Let us take 4 documents-:
D1: apple is good.
D2: doctor likes apple.
D3: apple keeps doctor away.
D4: doctor sorry apple.
So, here are 4 documents, the value of n=4.
Here are 7 distinct words, the value of m=7.
Assume 7 processors, so p=7.
Serial time for computing frequencies of each word is?
Mine answer-:
n documents*m words*f words/document* c read_time/(word)
=nmfc words*read_time
? why are the units like this? It should be just read_time in units I think.
Book's answer-:
Now, to calculate efficiency, we also need to calculate parallel time for case a).
Mine answer-:
efficiency=serial time/(p*parallel time)
= nmfc/(p*parallel time)
Parallel time=?
=m/p words * n documents * c read_time/word * f words/document
= nmfc/p word*read_time
So efficiency=1 ......................................................................(WRONG as per the book)
Book's answer-:
I don't understand a single word after this. It all went out of my head.
Why're we adding 2 different values except nmfc(unlike above case a) here? I don't get it.
Finding whether it's scalable or not is easy as you can just look at the initial guidelines, and see the answer and determine.