3 Implementation
The main components in a TensorFlow system are the client, which uses the Session interface to communicate with the master, and one or more worker processes, with each worker process responsible for arbitrating access to one or more computational devices (such as CPU cores or GPU cards) and for executing graph nodes on those devices as instructed by the master. We have both lo-cal and distributed implementations of the TensorFlow interface. The local implementation is used when the client, the master, and the worker all run on a single ma-chine in the context of a single operating system process (possibly with multiple devices, if for example, the ma-chine has many GPU cards installed). The distributed implementation shares most of the code with the local implementation, but extends it with support for an en-vironment where the client, the master, and the workers can all be in different processes on different machines. In our distributed environment, these different tasks are containers in jobs managed by a cluster scheduling sys-tem [51]. These two different modes are illustrated in Figure 3. Most of the rest of this section discusses is-sues that are common to both implementations, while Section 3.3 discusses some issues that are particular to the distributed implementation.
Devices
Devices are the computational heart of TensorFlow. Each worker is responsible for one or more devices, and each device has a device type, and a name. Device names are composed of pieces that identify the de-vice’s type, the device’s index within the worker, and, in our distributed setting, an identification of the job and task of the worker (or localhost for the case where the devices are local to the process). Example device names are "/job:localhost/device:cpu:0" or "/job:worker/task:17/device:gpu:3". We have implementations of our Device interface for CPUs and GPUs, and new device implementations for other de-vice types can be provided via a registration mechanism. Each device object is responsible for managing alloca-tion and deallocation of device memory, and for arrang-ing for the execution of any kernels that are requested by higher levels in the TensorFlow implementation.
Tensors
A tensor in our implementation is a typed, multi-dimensional array. We support a variety of tensor ele-ment types, including signed and unsigned integers rang-ing in size from 8 bits to 64 bits, IEEE float and double types, a complex number type, and a string type (an ar-bitrary byte array). Backing store of the appropriate size is managed by an allocator that is specific to the device on which the tensor resides. Tensor backing store buffers are reference counted and are deallocated when no refer-ences remain.
3.1 Single-Device Execution
Let’s first consider the simplest execution scenario: a sin-gle worker process with a single device. The nodes of the graph are executed in an order that respects the depen-dencies between nodes. In particular, we keep track of a count per node of the number of dependencies of that node that have not yet been executed. Once this count drops to zero, the node is eligible for execution and is added to a ready queue. The ready queue is processed in some unspecified order, delegating execution of the ker-nel for a node to the device object. When a node has finished executing, the counts of all nodes that depend on the completed node are decremented.
3.2 Multi-Device Execution
Once a system has multiple devices, there are two main complications: deciding which device to place the com-putation for each node in the graph, and then managing the required communication of data across device bound-aries implied by these placement decisions. This subsec-tion discusses these two issues.
3.2.1 Node Placement
Given a computation graph, one of the main responsi-bilities of the TensorFlow implementation is to map the computation onto the set of available devices. A sim-plified version of this algorithm is presented here. See Section 4.3 for extensions supported by this algorithm.
One input to the placement algorithm is a cost model, which contains estimates of the sizes (in bytes) of the input and output tensors for each graph node, along with estimates of the computation time required for each node when presented with its input tensors. This cost model is either statically estimated based on heuristics associated with different operation types, or is measured based on an actual set of placement decisions for earlier execu-tions of the graph.
The placement algorithm first runs a simulated execu-tion of the graph. The simulation is described below and ends up picking a device for each node in the graph using greedy heuristics. The node to device placement gener-ated by this simulation is also used as the placement for the real execution.
The placement algorithm starts with the sources of the computation graph, and simulates the activity on each device in the system as it progresses. For each node that is reached in this traversal, the set of feasible devices is considered (a device may not be feasible if the device does not provide a kernel that implements the particular operation). For nodes with multiple feasible devices, the placement algorithm uses a greedy heuristic that exam-ines the effects on the completion time of the node of placing the node on each possible device. This heuristic takes into account the estimated or measured execution time of the operation on that kind of device from the cost model, and also includes the costs of any communica-tion that would be introduced in order to transmit inputs to this node from other devices to the considered device. The device where the node’s operation would finish the soonest is selected as the device for that operation, and the placement process then continues onwards to make placement decisions for other nodes in the graph, includ-ing downstream nodes that are now ready for their own simulated execution. Section 4.3 describes some exten-sions that allow users to provide hints and partial con-straints to guide the placement algorithm. The placement algorithm is an area of ongoing development within the system.
3.2.2 Cross-Device Communication
Once the node placement has been computed, the graph is partitioned into a set of subgraphs, one per device. Any cross-device edge from x to y is removed and replaced by an edge from x to a new Send node in x’s subgraph and an edge from a corresponding Receive node to y in y’s subgraph. See Figure 4 for an example of this graph transformation.
Figure 4: Before & after insertion of Send/Receive nodes
At runtime, the implementations of the Send and Re-ceive nodes coordinate to transfer data across devices. This allows us to isolate all communication inside Send and Receive implementations, which simplifies the rest of the runtime.
When we insert Send and Receive nodes, we canoni-calize all users of a particular tensor on a particular de-vice to use a single Receive node, rather than one Re-ceive node per downstream user on a particular device. This ensures that the data for the needed tensor is only transmitted once between a source device → destination device pair, and that memory for the tensor on the desti-nation device is only allocated once, rather than multiple times (e.g., see nodes b and c in Figure 4)
By handling communication in this manner, we also allow the scheduling of individual nodes of the graph on different devices to be decentralized into the work-ers: the Send and Receive nodes impart the necessary synchronization between different workers and devices, and the master only needs to issue a single Run request per graph execution to each worker that has any nodes for the graph, rather than being involved in the scheduling of every node or every cross-device communication. This makes the system much more scalable and allows much finer-granularity node executions than if the scheduling were forced to be done by the master.
3.3 Distributed Execution
Distributed execution of a graph is very similar to multi-device execution. After device placement, a subgraph is created per device. Send/Receive node pairs that com-municate across worker processes use remote communi-cation mechanisms such as TCP or RDMA to move data across machine boundaries.
Fault Tolerance
Failures in a distributed execution can be detected in a variety of places. The main ones we rely on are (a) an error in a communication between a Send and Receive node pair, and (b) periodic health-checks from the master process to every worker process.
When a failure is detected, the entire graph execution is aborted and restarted from scratch. Recall however that Variable nodes refer to tensors that persist across ex-ecutions of the graph. We support consistent checkpoint-ing and recovery of this state on a restart. In partcular, each Variable node is connected to a Save node. These Save nodes are executed periodically, say once every N iterations, or once every N seconds. When they execute, the contents of the variables are written to persistent stor-age, e.g., a distributed file system. Similarly each Vari-able is connected to a Restore node that is only enabled in the first iteration after a restart. See Section 4.2 for details on how some nodes can only be enabled on some executions of the graph.
'Paper > Tensorflow' 카테고리의 다른 글
[번역]TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems[2] (0) | 2018.11.15 |
---|---|
[번역]TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems[1] (1) | 2018.11.15 |