Explain how the Map Reduce frameworks like Hadoop exploit locality to achieve scalable parallelism.
Map reduce is a general framework that allows us to write applications that can process large amounts of data running in parallel.This technique involves concept of distributed programming by using Java.However Hadoop can run MapReduce programs in various languages like ruby,python and c++.
The MapReduce model include data processing primitives called mappers and reducers.The main advantage of this is that it allows data processing over multiple computing nodes.
How MapReduce works?
1. It divides the work in 2 tasks-
a)Map Function- It breaks down a set of data into
tuples(Value/Pairs)and convert it into another set of data.
b)Reduce Function-This task is always performed after the Map has
completed its work. It take MAP's output as it's input and combines
those tuples into smaller set of tuples.
2.The execution of MapReduce is controlled by following-
a)Jobtracker-Also known as Master
b)Multiple Task trackers- Also known as Slaves.
3.A single job is divided into multiple tasks running on multiple
computer nodes.The job tracker role is to schedule the activities
over various nodes and coordinate those activities.Successful
execution of task is then ensured by Multiple task trackers.These
task trackers then sends a progress report to the Job tracker after
execution of each tasks and ensures overall progress of the job.If
an event fails,the job tracker can reschedule a news task on a
different task tracker.
Explain how the Map Reduce frameworks like Hadoop exploit locality to achieve scalable parallelism.
MapReduce and Hadoop (a) Explain the difference between map and reduce tasks in the MapReduce framework. (b) How does the Hadoop framework ensure that no reduce tasks can begin until all map tasks have finished? (c) When a worker node fails in Hadoop, its tasks are reassigned to other workers. What guarantees that the data being processed by the failed node is available to these other workers?
Explain how YARN extends Hadoop to enable multiple frameworks such as MapReduce, Giraph, Spark, and Flink. (Based on the paper: Apache Hadoop YARN: Yet Another Resource Negotiator)
How can systems utilize the principle of “locality” to increase performance? What is a “multi-banked” cache, and how can it be used to optimize the use of a cache to reduce memory access times? What factors are responsible for the resurgence of the popularity of virtual machines? What is “loop unrolling,” and how does it exploit parallelism?
Hadoop's default locality-first scheduling mainly targets map tasks. Why is it NOT useful to schedule reduce tasks? Explain your answers.
Can you explain how to formulate map units in these types of
questions?
I have a answered but i would like help with part B.
2. Recombination frequency is given below for several gene pairs. Create a linkage map for these genes, showing the map unit distance between loci. j, k 12%; k, 1 6%; j, m 9%; 1, m 15% 3. In guinea pigs, black (B) is dominant to brown (b), and solid color (S) is dominant to spotted...