Header Ads Widget

Job Scheduling

Job Scheduling :

Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order of submission, using a FIFO scheduler. Typically, each job would use the whole cluster, so jobs had to wait their turn. Although a shared cluster offers great potential for offering large resources to many users, the problem of sharing resources fairly between users requires a better scheduler. Production jobs need to complete in a timely manner while allowing users who are making smaller ad hoc queries to get results back in a reasonable time.

The ability to set a job’s priority was added, via the mapred. job.priority property or the setJobPriority() method on JobClient. When the job scheduler is choosing the next job to run, it selects the one with the highest priority.

However, with the FIFO scheduler, priorities do not support preemption, so a high-priority job can still be blocked by a long-running low priority job that started before the high-priority job was scheduled.

MapReduce in Hadoop comes with a choice of schedulers.

The default is the original FIFO queue-based scheduler, and there are also multiuser schedulers called :

  • The Fair Scheduler
  • The Capacity Scheduler.

The Fair Scheduler :

The Fair Scheduler aims to give every user a fair share of the cluster capacity over time.

If a single job is running, it gets all of the clusters. As more jobs are submitted, free task slots are given to the jobs in such a way as to give each user a fair share of the cluster.

A short job belonging to one user will complete in a reasonable time even while another user’s long job is running, and the long job will still make progress. Jobs are placed in pools, and by default, each user gets their own pool. It is also possible to define custom pools with guaranteed minimum capacities defined in terms of the number of maps and reduce slots, and to set weightings for each pool.

The Fair Scheduler supports preemption, so if a pool has not received its fair share for a certain period of time, then the scheduler will kill tasks in pools running over capacity in order to give the slots to the pool running under capacity.

The Capacity Scheduler :

The Capacity Scheduler takes a slightly different approach to multiuser scheduling. A cluster is made up of a number of queues (like the Fair Scheduler’s pools), which may be hierarchical (so a queue may be the child of another queue), and each queue has an allocated capacity.

This is like the Fair Scheduler, except that within each queue, jobs are scheduled using FIFO scheduling (with priorities). The Capacity Scheduler allows users or organizations to simulate a separate MapReduce cluster with FIFO scheduling for each user or organization.

The Fair Scheduler, by contrast, enforces fair sharing within each pool, so running jobs share the pool’s resources.

Post a Comment

0 Comments