MATLAB® Distributed Computing Engine 3 System Administrator’s Guide
How to Contact The MathWorks Web Newsgroup www.mathworks.com/contact_TS.html Technical Support www.mathworks.com comp.soft-sys.matlab suggest@mathworks.com bugs@mathworks.com doc@mathworks.com service@mathworks.com info@mathworks.com Product enhancement suggestions Bug reports Documentation error reports Order status, license renewals, passcodes Sales, pricing, and general information 508-647-7000 (Phone) 508-647-7001 (Fax) The MathWorks, Inc.
Revision History November 2005 December 2005 March 2006 September 2006 March 2007 September 2007 Online only Online only Online only Online only Online only Online only New for Version 2.0 (Release 14SP3+) Revised for Version 2.0 (Release 14SP3+) Revised for Version 2.0.1 (Release 2006a) Revised for Version 3.0 (Release 2006b) Revised for Version 3.1 (Release 2007a) Revised for Version 3.
Contents Introduction 1 What Are the Distributed Computing Products? . . . . . . Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Determining Product Installation and Versions . . . . . . . . . 1-2 1-2 1-3 Toolbox and Engine Components . . . . . . . . . . . . . . . . . . . . Job Managers, Workers, and Clients . . . . . . . . . . . . . . . . . . Third-Party Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Defining the Script Defaults . . . . . . . . . . . . . . . . . . . . . . . . . Overriding the Script Defaults . . . . . . . . . . . . . . . . . . . . . . . 2-10 2-11 Accessing Service Record Files . . . . . . . . . . . . . . . . . . . . . Locating Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Locating Checkpoint Directories . . . . . . . . . . . . . . . . . . . . . 2-13 2-13 2-14 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Control Scripts — Alphabetical List 4 Glossary Index vii
viii Contents
1 Introduction This chapter provides an introduction to the concepts and terms of Distributed Computing Toolbox and MATLAB® Distributed Computing Engine. What Are the Distributed Computing Products? (p. 1-2) Overview of Distributed Computing Toolbox and MATLAB Distributed Computing Engine, and their capabilities Toolbox and Engine Components (p. 1-4) Descriptions of the parts and configurations of a distributed computing setup Using Distributed Computing Toolbox (p.
1 Introduction What Are the Distributed Computing Products? In this section... “Overview” on page 1-2 “Determining Product Installation and Versions” on page 1-3 Overview Distributed Computing Toolbox and MATLAB Distributed Computing Engine enable you to coordinate and execute independent MATLAB operations simultaneously on a cluster of computers, speeding up execution of large MATLAB jobs. A job is some large operation that you need to perform in your MATLAB session.
What Are the Distributed Computing Products? MATLAB Worker MATLAB Distributed Computing Engine MATLAB Client Distributed Computing Toolbox Scheduler or Job Manager MATLAB Worker MATLAB Distributed Computing Engine MATLAB Worker MATLAB Distributed Computing Engine Basic Distributed Computing Configuration Determining Product Installation and Versions To determine if Distributed Computing Toolbox is installed on your system, type this command at the MATLAB prompt: ver When you enter this command, MATL
1 Introduction Toolbox and Engine Components In this section... “Job Managers, Workers, and Clients” on page 1-4 “Third-Party Schedulers” on page 1-6 “Components on Mixed Platforms or Heterogeneous Clusters” on page 1-7 “MATLAB Distributed Computing Engine Service” on page 1-7 Job Managers, Workers, and Clients The optional job manager can run on any machine on the network.
Toolbox and Engine Components Task Job Client All Results Job Client Results Scheduler or Job Manager Task Results Task All Results Results Worker Worker Worker Interactions of Distributed Computing Sessions A large network might include several job managers as well as several client sessions. Any client session can create, run, and access jobs on any job manager, but a worker session is registered with and dedicated to only one job manager at a time.
1 Introduction Third-Party Schedulers As an alternative to using the MathWorks job manager, you can use a third-party scheduler. This could be Windows CCS, Platform Computing LSF, mpiexec, or a generic scheduler.
Toolbox and Engine Components If you have a large cluster, you probably already have a scheduler. Consult your MathWorks representative if you have questions about cluster size and the job manager. • Who administers your cluster? The person administering your cluster might have a preference for how jobs are scheduled. Components on Mixed Platforms or Heterogeneous Clusters Distributed Computing Toolbox and MATLAB Distributed Computing Engine are supported on Windows, UNIX, and Macintosh platforms.
1 Introduction Using Distributed Computing Toolbox A typical Distributed Computing Toolbox client session includes the following steps: 1 Find a Job Manager (or scheduler) — Your network may have one or more job managers available (but usually only one scheduler). The function you use to find a job manager or scheduler creates an object in your current MATLAB session to represent the job manager or scheduler that will run your job. 2 Create a Job — You create a job to hold a collection of tasks.
2 Network Administration This chapter provides information useful for network administration of Distributed Computing Toolbox and MATLAB Distributed Computing Engine. Preparing for Distributed Computing (p. 2-2) Examines network requirements and limitations for running Distributed Computing Toolbox and MATLAB Distributed Computing Engine Installing and Configuring (p. 2-5) Where to find installation and configuration instructions Shutting Down a Job Manager Configuration (p.
2 Network Administration Preparing for Distributed Computing In this section... “Before You Start” on page 2-2 “Planning Your Network Layout” on page 2-2 “Network Requirements” on page 2-3 “Fully Qualified Domain Names” on page 2-3 “Security Considerations” on page 2-4 This section discusses the requirements and configurations for your network to support distributed computing.
Preparing for Distributed Computing Session Product Processes Client Distributed Computing Toolbox MATLAB with toolbox Worker MATLAB Distributed Computing Engine worker; mdce service (if using a job manager) Job manager MATLAB Distributed Computing Engine mdce service; job manager The MATLAB Distributed Computing Engine (mdce) service or daemon is included in the engine software.
2 Network Administration Security Considerations The distributed computing products do not provide any security measures. Therefore, you should be aware of the following security considerations: • MATLAB workers run as whatever user the administrator starts the node’s mdce service under. By default, the mdce service starts as root on UNIX and as LocalSystem on Windows. Because MATLAB provides system calls, users can submit jobs that execute shell commands.
Installing and Configuring Installing and Configuring To find the most up-to-date instructions for installing and configuring the current or past versions of the distributed computing products, visit the MathWorks Web site at http://www.mathworks.
2 Network Administration Shutting Down a Job Manager Configuration In this section... “UNIX and Macintosh” on page 2-6 “Windows” on page 2-8 If you are done using the job manager and its workers, you might want to shut down the engine processes so that they are not consuming network resources. You do not need to be at the computer running the processes that you are shutting down. You can run these commands from any machine with network access to the processes.
Shutting Down a Job Manager Configuration If you have more than one worker session running, you can stop each of them individually by host and name. stopworker -name worker1 -remotehost stopworker -name worker2 -remotehost For a list of all options to the script, type stopworker -help Stopping and Uninstalling the MDCE Daemon Normally, you configure the mdce daemon to start at system boot time and continue running until the machine shuts down.
2 Network Administration Windows Stopping the Job Manager and Workers 1 To shut down the job manager, enter the commands cd matlabroot\toolbox\distcomp\bin (Enter the following command on a single line.) stopjobmanager -remotehost -name -v If you have more than one job manager running, stop each of them individually by host and name.
Shutting Down a Job Manager Configuration cd matlabroot\toolbox\distcomp\bin mdce stop If you plan to uninstall MATLAB Distributed Computing Engine from a machine, you might want to uninstall the mdce service also, as you will not need it any longer. You do not need to stop the service before uninstalling it. To uninstall the mdce service, enter the following commands at a DOS command prompt.
2 Network Administration Customizing Engine Services In this section... “Defining the Script Defaults” on page 2-10 “Overriding the Script Defaults” on page 2-11 The scripts of MATLAB Distributed Computing Engine run using several default parameters. You can customize the scripts, as described in this section. Defining the Script Defaults The scripts for the engine services require values for several parameters. These parameters set the process name, the user name, log file location, ports, etc.
Customizing Engine Services Setting the User By default, the job manager and worker services run as the user who starts them. You can run the services as a different user with the following setings in the mdce_def file. Parameter Description MDCEUSER Set this parameter to run the mdce services as a user different from the user who starts the service. On a UNIX system, set the value before starting the service; on a Windows system, set it before installing the service.
2 Network Administration • matlabroot\toolbox\distcomp\bin\mdce_def.bat (Windows) • matlabroot/toolbox/distcomp/bin/mdce_def.sh (UNIX or Macintosh) Before installing and starting the mdce service, you can edit this file to set the default parameters with values you require. Alternatively, you can make a copy of this file, modify the copy, and specify that this copy be used for the default parameters. On UNIX or Macintosh, mdce start -mdcedef my_mdce_def.sh On Windows, mdce install -mdcedef my_mdce_def.
Accessing Service Record Files Accessing Service Record Files In this section... “Locating Log Files” on page 2-13 “Locating Checkpoint Directories” on page 2-14 The services of MATLAB Distributed Computing Engine generate various record files in the normal course of their operations. The mdce service, job manager, and worker sessions all generate such files. The types of information stored by the services are described in this section.
2 Network Administration Locating Checkpoint Directories Checkpoint directories contain information related to persistence data, which the engine services use to create continuity from one instance of a session to another. For example, if you stop and restart a job manager, the new session will continue the old session, using all the same data. A primary feature offered by the checkpoint directories is in crash recovery.
Accessing Service Record Files Platform File Location Windows On Windows systems, the default location of the checkpoint directories is \MDCE\Checkpoint, where is the value of the system TEMP variable. For example, if TEMP is set to C:\TEMP, then the checkpoint directories are placed in C:\TEMP\MDCE\Checkpoint. You can set alternative locations for the checkpoint directories by modifying the CHECKPOINTBASE setting in the mdce_def.bat file before starting the mdce service.
2 Network Administration Troubleshooting In this section... “License Errors” on page 2-16 “Verifying Multicast Communications” on page 2-18 “Memory Errors on UNIX” on page 2-19 “Running MDCE Processes from a Windows Network Installation” on page 2-19 “Required Ports” on page 2-20 This section offers advice on solving problems you might encounter with the MABLAB Distributed Computing Engine.
Troubleshooting • If you receive this error when starting a worker with the Distributed Computing Engine - You may be calling the startworker command from an installation that does not have access to a worker license. For example, starting a worker from a client installation of Distributed Computing Toolbox causes the following error. The mdce service on the host hostname returned the following error: Problem starting the MATLAB worker.
2 Network Administration Have your MATLAB administrator verify that the license manager is running and validate network services. For more information, see The MathWorks Support page at http://www.mathworks.
Troubleshooting Inside MATLAB, the class would be used as follows. m = com.mathworks.toolbox.distcomp.test.MulticastTester('239.1.1.1', 9999); m.startSendingThread; m.startListeningThread; 0 : host1name : 0 1 : host2name : 0 From a shell prompt, you would type (assuming that Java is on your path) java -cp distcomp.jar com.mathworks.toolbox.distcomp.test.
2 Network Administration Required Ports Using a Job Manager BASE_PORT. The ports required by the job manager and all workers are specified and described in the mdce_def file. See the following file in the MATLAB installation used for each cluster process: matlabroot/toolbox/bin/distcomp/mdce_def.sh (Unix) matlabroot\toolbox\bin\distcomp\mdce_def.bat (Windows) Parallel Jobs.
Troubleshooting 3 On the Edit menu, click New, and then add the following registry entry: Value Name: MaxUserPort Value Type: DWORD Value data: 65534 Valid Range: 5000-65534 (decimal) Default: 0x1388 (5000 decimal) Description: This parameter controls the maximum port number that is used when a program requests any available user port from the system. Typically , ephemeral (short-lived) ports are allocated between the values of 1024 and 5000 inclusive. 4 Quit Registry Editor. 5 Reboot your machine.
2 2-22 Network Administration
3 Control Scripts — By Category MDCE Control (p. 3-2) Control mdce service Job Manager Control (p. 3-2) Control job manager Worker Control (p.
3 Control Scripts — By Category MDCE Control mdce Install, start, stop, or uninstall mdce service nodestatus Status of MDCE processes running on node Job Manager Control startjobmanager Start job manager process stopjobmanager Stop job manager process Worker Control 3-2 startworker Start MATLAB worker session stopworker Stop MATLAB worker session
4 Control Scripts — Alphabetical List
mdce Purpose Install, start, stop, or uninstall mdce service Syntax mdce mdce mdce mdce mdce mdce mdce mdce Description The mdce service ensures that all other processes are running and that it is possible to communicate with them. Once the mdce service is running, you can use the nodestatus command to obtain information about the mdce service and all the processes it maintains. install uninstall start stop console restart ...
mdce mdce ... -mdcedef uses the specified alternative mdce defaults file instead of the one found in matlabroot/toolbox/distcomp/bin. mdce status reports the status of the mdce service, indicating whether it is running and with what PID. Use nodestatus to obtain more detailed information about the mdce service. The mdce status command is available only on UNIX and Macintosh.
nodestatus Purpose Status of MDCE processes running on node Syntax nodestatus nodestatus -flags Description nodestatus displays the status of the mdce service and the processes which it maintains. The mdce service must already be running on the specified computer. nodestatus -flags accepts the following input flags. Multiple flags can be used together on the same command.
nodestatus Examples Display basic information about the mdce processes on the local host. nodestatus Display detailed information about the status of the mdce processes on host node27.
startjobmanager Purpose Start job manager process Syntax startjobmanager startjobmanager -flags Description startjobmanager starts a job manager process and the associated job manager lookup process under the mdce service, which maintains them after that. The job manager handles the storage of jobs and the distribution of tasks contained in jobs to MATLAB workers that are registered with it. The mdce service must already be running on the specified computer.
startjobmanager Examples Flag Operation -multicast Overrides the use of unicast to contact the job manager lookup process. It is recommended that you not use -multicast unless you are certain that multicast works on your network. This overrides the setting of JOB_MANAGER_HOST in the mdce_def file on the remote host, which would have the job manager use unicast.
startjobmanager Start the job manager MyJobManager on the host JMHost.
startworker Purpose Start MATLAB worker session Syntax startworker startworker -flags Description startworker starts a MATLAB worker process under the mdce service, which maintains it after that. The worker registers with the specified job manager, from which it will get tasks for evaluation. The mdce service must already be running on the specified computer. startworker -flags accepts the following input flags. Multiple flags can be used together on the same command, except where noted.
startworker Flag Operation -jobmanagerhost Specifies the host on which the job manager is running by using -jobmanagerhost. The worker will then use unicast to contact the job manager lookup process on that host in order to register with the job manager. This overrides the setting of JOB_MANAGER_HOST in the mdce_def file on the worker computer, which would also have the worker use unicast. Cannot be used together with -multicast.
startworker Examples Flag Operation -baseport Specifies the base port that the mdce service on the remote host is using. You only need to specify this if the value of BASE_PORT in the local mdce_def file does not match the base port being used by the mdce service on the remote host. -v Verbose mode displays the progress of the command execution. Start a worker on the local host, using the default worker name, registering with the job manager MyJobManager on the host JMHost.
startworker See Also 4-12 mdce, nodestatus, startjobmanager, stopjobmanager, stopworker
stopjobmanager Purpose Stop job manager process Syntax stopjobmanager stopjobmanager -flags Description stopjobmanager stops a job manager that is running under the mdce service. stopjobmanager -flags accepts the following input flags. Multiple flags can be used together on the same command. Flag Operation -name Specifies the name of the job manager to stop. The default is the value of DEFAULT_JOB_MANAGER_NAME parameter the mdce_def file.
stopjobmanager Examples Stop the job manager MyJobManager on the local host. stopjobmanager -name MyJobManager Stop the job manager MyJobManager on the host JMHost.
stopworker Purpose Stop MATLAB worker session Syntax stopworker stopworker -flags Description stopworker stops a MATLAB worker process that is running under the mdce service. stopworker -flags accepts the following input flags. Multiple flags can be used together on the same command. Flag Operation -name Specifies the name of the MATLAB worker to stop. The default is the value of the DEFAULT_WORKER_NAME parameter in the mdce_def file.
stopworker Examples Stop the worker with the default name on the local host. stopworker Stop the worker with the default name, running on the computer WorkerHost. stopworker -remotehost WorkerHost Stop the workers named worker1 and worker2, running on the computer WorkerHost.
Glossary Glossary CHECKPOINTBASE The name of the parameter in the mdce_def file that defines the location of the job manager and worker checkpoint directories. checkpoint directory Location where job manager checkpoint information and worker checkpoint information is stored. client The MATLAB session that defines and submits the job. This is the MATLAB session in which the programmer usually develops and prototypes applications. Also known as the MATLAB client.
Glossary distributed computing Computing with distributed applications, running the application on several nodes simultaneously. distributed computing demos Demonstration programs that use Distributed Computing Toolbox, as opposed to sequential demos. DNS Domain Name System. A system that translates Internet domain names into IP addresses.
Glossary job The complete large-scale operation to perform in MATLAB, composed of a set of tasks. job manager The MathWorks process that queues jobs and assigns tasks to workers. A third-party process that performs this function is called a scheduler. The general term "scheduler" can also refer to a job manager. job manager checkpoint information Snapshot of information necessary for the job manager to recover from a system crash or reboot.
Glossary mdce The service that has to run on all machines before they can run a job manager or worker. This is the engine foundation process, making sure that the job manager and worker processes that it controls are always running. Note that the program and service name is all lowercase letters. mdce_def file The file that defines all the defaults for the mdce processes by allowing you to set preferences or definitions in the form of parameter values.
Glossary scheduler The process, either third-party or the MathWorks job manager, that queues jobs and assigns tasks to workers. task One segment of a job to be evaluated by a worker. variant array An array which resides in the workspaces of all labs, but whose content differs on these labs. worker The MATLAB process that performs the task computations. Also known as the MATLAB worker or worker process. worker checkpoint information Files required by the worker during the execution of tasks.
Glossary Glossary-6
Index A Index administration network 2-1 C checkpoint directory definition Glossary-1 locating 2-14 CHECKPOINTBASE definition Glossary-1 clean state starting services 2-12 client definition Glossary-1 process 1-4 client computer definition Glossary-1 cluster definition Glossary-1 coarse-grained application definition Glossary-1 computer definition Glossary-1 configuring MDCE 2-5 control scripts customizing 2-10 defaults 2-10 mdce 4-2 nodestatus 4-4 startjobmanager 4-6 startworker 4-9 stopjobmanager 4-13
Index database definition Glossary-3 definition Glossary-3 logs 2-13 lookup process definition Glossary-3 multiple on one machine 2-10 process 1-4 stopping on UNIX or Macintosh 2-6 on Windows 2-8 versus third-party scheduler 1-6 L lab definition Glossary-3 log files locating 2-13 LOGDIR definition Glossary-3 M MathWorks job manager.
Index license errors 2-16 memory errors 2-19 verifying multicast 2-18 Windows network installation 2-19 U user setting 2-11 definition Glossary-5 process 1-4 worker checkpoint information definition Glossary-5 workers logs 2-13 stopping on UNIX or Macintosh 2-6 on Windows 2-8 W worker Index-3