3 Implementation


The main components in a TensorFlow system are the client, which uses the Session interface to communicate with the master, and one or more worker processes, with each worker process responsible for arbitrating access to one or more computational devices (such as CPU cores or GPU cards) and for executing graph nodes on those devices as instructed by the master. We have both lo-cal and distributed implementations of the TensorFlow interface. The local implementation is used when the client, the master, and the worker all run on a single ma-chine in the context of a single operating system process (possibly with multiple devices, if for example, the ma-chine has many GPU cards installed). The distributed implementation shares most of the code with the local implementation, but extends it with support for an en-vironment where the client, the master, and the workers can all be in different processes on different machines. In our distributed environment, these different tasks are containers in jobs managed by a cluster scheduling sys-tem [51]. These two different modes are illustrated in Figure 3. Most of the rest of this section discusses is-sues that are common to both implementations, while Section 3.3 discusses some issues that are particular to the distributed implementation.


Devices


Devices are the computational heart of TensorFlow. Each worker is responsible for one or more devices, and each device has a device type, and a name. Device names are composed of pieces that identify the de-vice’s type, the device’s index within the worker, and, in our distributed setting, an identification of the job and task of the worker (or localhost for the case where the devices are local to the process). Example device names are "/job:localhost/device:cpu:0" or "/job:worker/task:17/device:gpu:3". We have implementations of our Device interface for CPUs and GPUs, and new device implementations for other de-vice types can be provided via a registration mechanism. Each device object is responsible for managing alloca-tion and deallocation of device memory, and for arrang-ing for the execution of any kernels that are requested by higher levels in the TensorFlow implementation.


Tensors


A tensor in our implementation is a typed, multi-dimensional array. We support a variety of tensor ele-ment types, including signed and unsigned integers rang-ing in size from 8 bits to 64 bits, IEEE float and double types, a complex number type, and a string type (an ar-bitrary byte array). Backing store of the appropriate size is managed by an allocator that is specific to the device on which the tensor resides. Tensor backing store buffers are reference counted and are deallocated when no refer-ences remain.


3.1 Single-Device Execution


Let’s first consider the simplest execution scenario: a sin-gle worker process with a single device. The nodes of the graph are executed in an order that respects the depen-dencies between nodes. In particular, we keep track of a count per node of the number of dependencies of that node that have not yet been executed. Once this count drops to zero, the node is eligible for execution and is added to a ready queue. The ready queue is processed in some unspecified order, delegating execution of the ker-nel for a node to the device object. When a node has finished executing, the counts of all nodes that depend on the completed node are decremented.


3.2 Multi-Device Execution


Once a system has multiple devices, there are two main complications: deciding which device to place the com-putation for each node in the graph, and then managing the required communication of data across device bound-aries implied by these placement decisions. This subsec-tion discusses these two issues.


3.2.1 Node Placement


Given a computation graph, one of the main responsi-bilities of the TensorFlow implementation is to map the computation onto the set of available devices. A sim-plified version of this algorithm is presented here. See Section 4.3 for extensions supported by this algorithm.

One input to the placement algorithm is a cost model, which contains estimates of the sizes (in bytes) of the input and output tensors for each graph node, along with estimates of the computation time required for each node when presented with its input tensors. This cost model is either statically estimated based on heuristics associated with different operation types, or is measured based on an actual set of placement decisions for earlier execu-tions of the graph.

The placement algorithm first runs a simulated execu-tion of the graph. The simulation is described below and ends up picking a device for each node in the graph using greedy heuristics. The node to device placement gener-ated by this simulation is also used as the placement for the real execution.

The placement algorithm starts with the sources of the computation graph, and simulates the activity on each device in the system as it progresses. For each node that is reached in this traversal, the set of feasible devices is considered (a device may not be feasible if the device does not provide a kernel that implements the particular operation). For nodes with multiple feasible devices, the placement algorithm uses a greedy heuristic that exam-ines the effects on the completion time of the node of placing the node on each possible device. This heuristic takes into account the estimated or measured execution time of the operation on that kind of device from the cost model, and also includes the costs of any communica-tion that would be introduced in order to transmit inputs to this node from other devices to the considered device. The device where the node’s operation would finish the soonest is selected as the device for that operation, and the placement process then continues onwards to make placement decisions for other nodes in the graph, includ-ing downstream nodes that are now ready for their own simulated execution. Section 4.3 describes some exten-sions that allow users to provide hints and partial con-straints to guide the placement algorithm. The placement algorithm is an area of ongoing development within the system.


3.2.2 Cross-Device Communication


Once the node placement has been computed, the graph is partitioned into a set of subgraphs, one per device. Any cross-device edge from x to y is removed and replaced by an edge from x to a new Send node in x’s subgraph and an edge from a corresponding Receive node to y in y’s subgraph. See Figure 4 for an example of this graph transformation.


Figure 4: Before & after insertion of Send/Receive nodes


At runtime, the implementations of the Send and Re-ceive nodes coordinate to transfer data across devices. This allows us to isolate all communication inside Send and Receive implementations, which simplifies the rest of the runtime.

When we insert Send and Receive nodes, we canoni-calize all users of a particular tensor on a particular de-vice to use a single Receive node, rather than one Re-ceive node per downstream user on a particular device. This ensures that the data for the needed tensor is only transmitted once between a source device → destination device pair, and that memory for the tensor on the desti-nation device is only allocated once, rather than multiple times (e.g., see nodes b and c in Figure 4)

By handling communication in this manner, we also allow the scheduling of individual nodes of the graph on different devices to be decentralized into the work-ers: the Send and Receive nodes impart the necessary synchronization between different workers and devices, and the master only needs to issue a single Run request per graph execution to each worker that has any nodes for the graph, rather than being involved in the scheduling of every node or every cross-device communication. This makes the system much more scalable and allows much finer-granularity node executions than if the scheduling were forced to be done by the master.


3.3 Distributed Execution


Distributed execution of a graph is very similar to multi-device execution. After device placement, a subgraph is created per device. Send/Receive node pairs that com-municate across worker processes use remote communi-cation mechanisms such as TCP or RDMA to move data across machine boundaries.


Fault Tolerance


Failures in a distributed execution can be detected in a variety of places. The main ones we rely on are (a) an error in a communication between a Send and Receive node pair, and (b) periodic health-checks from the master process to every worker process.

When a failure is detected, the entire graph execution is aborted and restarted from scratch. Recall however that Variable nodes refer to tensors that persist across ex-ecutions of the graph. We support consistent checkpoint-ing and recovery of this state on a restart. In partcular, each Variable node is connected to a Save node. These Save nodes are executed periodically, say once every N iterations, or once every N seconds. When they execute, the contents of the variables are written to persistent stor-age, e.g., a distributed file system. Similarly each Vari-able is connected to a Restore node that is only enabled in the first iteration after a restart. See Section 4.2 for details on how some nodes can only be enabled on some executions of the graph.


2. Programming Model and Basic Concepts


A TensorFlow computation is described by a directed graph, which is composed of a set of nodes. The graph represents a dataflow computation, with extensions for allowing some kinds of nodes to maintain and update persistent state and for branching and looping control structures within the graph in a manner similar to Naiad [36]. Clients typically construct a computational graph using one of the supported frontend languages (C++ or Python). An example fragment to construct and then execute a TensorFlow graph using the Python front end is shown in Figure 1, and the resulting computation graph in Figure 2. 


In a TensorFlow graph, each node has zero or more in-puts and zero or more outputs, and represents the instantiation of an operation. Values that flow along normal edges in the graph (from outputs to inputs) are tensors, arbitrary dimensionality arrays where the underlying element type is specified or inferred at graph-construction time. Special edges, called control dependencies, can also exist in the graph: no data flows along such edges, but they indicate that the source node for the control dependence must finish executing before the destination node for the control dependence starts executing. Since our model includes mutable state, control dependencies can be used directly by clients to enforce happens before relationships. Our implementation also sometimes inserts control dependencies to enforce orderings between otherwise independent operations as a way of, for example, controlling the peak memory usage. 



Operations and Kernels


An operation has a name and represents an abstract computation (e.g., “matrix multiply”, or “add”). An operation can have attributes, and all attributes must be provided or inferred at graph-construction time in order to instantiate a node to perform the operation. One common use of attributes is to make operations polymorphic over different tensor element types (e.g., add of two tensors of type float versus add of two tensors of type int32). A kernel is a particular implementation of an operation that can be run on a particular type of device (e.g., CPU or GPU). A TensorFlow binary defines the sets of operations and kernels available via a registration mechanism, and this set can be extended by linking in additional operation and/or kernel definitions/registrations. Table 1 shows some of the kinds of operations built into the core TensorFlow library. 



Sessions


Clients programs interact with the TensorFlow system by creating a Session. To create a computation graph, the Session interface supports an Extend method to augment the current graph managed by the session with additional nodes and edges (the initial graph when a session is created is empty). The other primary operation supported by the session interface is Run, which takes a set of out-put names that need to be computed, as well as an optional set of tensors to be fed into the graph in place of certain outputs of nodes. Using the arguments to Run, the TensorFlow implementation can compute the transitive closure of all nodes that must be executed in order to compute the outputs that were requested, and can then arrange to execute the appropriate nodes in an order that respects their dependencies (as described in more detail in 3.1). Most of our uses of TensorFlow set up a Session with a graph once, and then execute the full graph or a few distinct subgraphs thousands or millions of times via Run calls. 



Variables


In most computations a graph is executed multiple times. Most tensors do not survive past a single execution of the graph. However, a Variable is a special kind of operation that returns a handle to a persistent mutable tensor that survives across executions of a graph. Handles to these persistent mutable tensors can be passed to a handful of special operations, such as Assign and AssignAdd (equivalent to +=) that mutate the referenced tensor. For machine learning applications of TensorFlow, the parameters of the model are typically stored in tensors held in variables, and are updated as part of the Run of the training graph for the model. 

Abstract


TensorFlow [1] is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.


초록


텐서플로는 머신 러닝 알고리즘을 표현하는 인터페이스이며, 이러한 알고리즘을 실행하기 위해 구현됐다. 텐서플로는 활용한 연산은 스마트폰, 태블릿부터 그래픽카드와 같은 수천개의 연산장치까지 다양한 시스템을 그대로 이용할 수 있다. 이 시스템은 유연하며, 심층 신경망 모델의 알고리즘을 추론하고 학습하는 것부터 시작해 음성 인식, 컴퓨터 비전, 로보틱스, 정보 검색, 자연어처리, 지리정보 추출, 연산 제약 발견과 같은 다양한 분야에서 기계 학습 시스템을 적용하고 연구할 수 있다. 이 논문은 텐서플로 인터페이스에 대해 설명하며 구글에서 구성된 인터페이스로 구현됐다. 텐서플로 API와 레퍼런스는 www.tensorflow.org에서 2015년 11월 아파치2.0으로 라이선스되 오픈소스 패키지로 이용가능하다.


1. Introduction


The Google Brain project started in 2011 to explore the use of very-large-scale deep neural networks, both for research and for use in Google’s products. As part of the early work in this project, we built DistBelief, our first-generation scalable distributed training and inference system [14], and this system has served us well. We and others at Google have performed a wide variety of research using DistBelief including work on unsupervised learning [31], language representation [35, 52], models for image classification and object detection [16, 48], video classification [27], speech recognition [56, 21, 20], sequence prediction [47], move selection for Go [34], pedestrian detection [2], reinforcement learning [38], and other areas [17, 5]. In addition, often in close collaboration with the Google Brain team, more than 50 teams at Google and other Alphabet companies have deployed deep neural networks using DistBelief in a wide variety of products, including Google Search [11], our advertis-ing products, our speech recognition systems [50, 6, 46], Google Photos [43], Google Maps and StreetView [19], Google Translate [18], YouTube, and many others. 


1. 소개


구글 브레인 프로젝트는 2011년 연구와 구글 제품 사용의 목적으로 방대한 심층 신경망 연구부터 시작됐다. 프로젝트 초기 첫번째 분산 학습과 추론 시스템인 DistBelief를 구현했으며 정상적으로 동작했다. 그 후, 비지도 학습, 언어 표현, 이미지 분류와 객체 인식 모델, 비디오 분류, 음성 인식, 시퀀스 예측, Go언어의 이동 선택, 보행자 분석, 강화 학습등의 분야에서 DistBelied를 활용하기 시작했다. 게다가, 구글 브레인 팀과 협력으로 50개 이상의 구글팀과 Alphabet사가 구글 검색, 광고 제품, 음성 인식 시스템, 구글 포토, 구글 맵 및 스트리트뷰, 구글 번역, 유튜브등과 같은 제품서비스에 적용하기 시작했다.


Based on our experience with DistBelief and a more complete understanding of the desirable system properties and requirements for training and using neural networks, we have built TensorFlow, our second-generation system for the implementation and deployment of large-scale machine learning models. TensorFlow takes computations described using a dataflow-like model and maps them onto a wide variety of different hardware platforms, ranging from running inference on mobile device platforms such as Android and iOS to modest-sized training and inference systems using single machines containing one or many GPU cards to large-scale training systems running on hundreds of specialized ma-chines with thousands of GPUs. Having a single system that can span such a broad range of platforms significantly simplifies the real-world use of machine learning system, as we have found that having separate systems for large-scale training and small-scale deployment leads to significant maintenance burdens and leaky abstractions. TensorFlow computations are expressed as stateful dataflow graphs (described in more detail in Section 2), and we have focused on making the system both flexible enough for quickly experimenting with new models for research purposes and sufficiently high performance and robust for production training and deployment of machine learning models. For scaling neural network training to larger deployments, TensorFlow allows clients to easily express various kinds of parallelism through replication and parallel execution of a core model dataflow graph, with many different computational devices all collaborating to update a set of shared parameters or other state. Modest changes in the description of the computation allow a wide variety of different approaches to parallelism to be achieved and tried with low effort [14, 29, 42]. Some TensorFlow uses allow some flexibility in terms of the consistency of parameter updates, and we can easily express and take advantage of these relaxed synchronization requirements in some of our larger deployments. Compared to DistBelief, TensorFlow’s programming model is more flexible, its performance is significantly better, and it supports training and using a broader range of models on a wider variety of heterogeneous hardware platforms. 


DistBelif에서의 경험과 분산 시스템의 속성과 신경망을 이용한 학습에 필요한 부분을 기반해, 방대한 규모의 머신 러닝 모델을 구현하는 두번재 시스템인 텐서플로를 개발했다. 텐서플로는 수천개의 그래픽카드가 연결된 특수한 대규모 학습 시스템부터 여러 그래픽카드로 이뤄진 단일 시스템과 안드로이드 iOS와 같은 모바일 디바이스 플랫폼까지 다양한 하드웨어 플랫폼에서 사용할 수 있는 연산 기반 데이터 플로우 모델 및 매핑 형태를 취한다. 대규모 학습 및 소규모 구축에 필요한 별도의 시스템을 갖추면 유지 보수에 상당한 비용이 초래되므로, 다양한 플랫폼으로 확장할 수 있는 단일 시스템은 머신 러닝 시스템을 보다 간단히 사용할 수 있게한다. 텐서플로 연산은 데이터 플로우 그래프로 표현되며, 연구 목적 및 고성능 연산, 머신 러닝 모델을 간편히 구축하고 학습하는 제품을 보다 빨르고 유연하게 구성할 수 있게 한다. 신경망 학습을 좀 더 크게 확장하고자, 텐서플로는 코어 모델 데이터 플로우 그래프를 복제 및 병렬 연산을 통해 쉽게 표현할 수 있게 클라이언트에서 제공하며 파라미터 및 다른 상태를 공유할 수 있게 여러 연산 디바이스와 정보를 주고받는다. 이처럼, 다양한 변화를 통해 병렬 연산에 대해 다양하게 접근하고 쉽게 코드를 작성할 수 있다. 몇몇 텐서플로는 파라미터 갱신의 연속성의 관점에서 유연성을 제공하는데, 대규모 시스템의 동기화 문제에 있어 많은 도움이 된다. DistBelief와 비교해 텐서플로의 프로그래밍 모델은 훨씬 유연하고 성능 또한 좋으며, 다양한 하드웨어 플랫폼을 지원해 학습할 수 있다.


Dozens of our internal clients of DistBelief have already switched to TensorFlow. These clients rely on TensorFlow for research and production, with tasks as diverse as running inference for computer vision models on mobile phones to large-scale training of deep neural networks with hundreds of billions of parameters on hundreds of billions of example records using many hundreds of machines [11, 47, 48, 18, 53, 41]. Although these applications have concentrated on ma-chine learning and deep neural networks in particular, we expect that TensorFlow’s abstractions will be useful in a variety of other domains, including other kinds of machine learning algorithms, and possibly other kinds of numerical computations. We have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license in November, 2015, available at www.tensorflow.org. 


DistBelief의 여러 내부 클라이언트는 이미 텐서플로로 전환됐다. 이러한 클라이언트는 텐서플로에 종속돼 연구와 제품에 사용되며 모바일 폰부터 수백개의 장치로 수천억개의 파라미터를 구성하는 대규모 학습 심층 신경망까지 컴퓨터 비젼 추론 실행에 이용된다. 이러한 애플리케이션이 머신 러닝과 심층 신경망에 집중되 있더라도, 텐서플로는 머신 러닝 알고리즘과 수치 연산과 같은 다양한 영역에서 사용할 수 있음이 기대된다. 텐서플로 API의 경우 오픈 소스로 공개했으며, 레퍼런스는 아파치 2.0 라이선스 형태로 www.tensorflow.org에서 확인할 수 있다.


The rest of this paper describes TensorFlow in more detail. Section 2 describes the programming model and basic concepts of the TensorFlow interface, and Section 3 describes both our single machine and distributed imple-mentations. Section 4 describes several extensions to the basic programming model, and Section 5 describes several optimizations to the basic implementations. Section 6 describes some of our experiences in using Ten-sorFlow, Section 7 describes several programming idioms we have found helpful when using TensorFlow, and Section 9 describes several auxiliary tools we have built around the core TensorFlow system. Sections 10 and 11 discuss future and related work, respectively, and Sec-tion 12 offers concluding thoughts. 


앞으로 이 논문에서 텐서플로를 좀 더 자세히 살펴본다. 섹션 2에서는 프로그래밍 모델과 텐서플로 인터페이스의 기본 개념을 설명하고, 섹션 3에서는 단일 장치와 분산 시스템에 대해 알아본다. 섹션 4에서는 기본적인 프로그래밍 모델로 확장하며, 섹션 5에서는 기본적인 구현을 최적화해본다. 섹션 6에서는 텐서플로를 이용해 코드를 작성해보며, 섹션 7에서는 텐서플로에 사용되는 프로그래밍 함수를 살펴보며, 섹션 9에서는 텐서플로 시스템에 구성된 보조 도구를 알아본다. 섹션 10과 11에서는 미래의 활용에 대해 고민해보며, 섹션 12에서 이 모든 내용을 정리해본다.


While the nal goal of a statistical machine translation system is to create a model of the target sentence E given the source sentence F, P(E j F), in this chapter we will take a step back, and attempt to create a language model of only the target sentence P(E). Basically, this model allows us to do two things that are of practical use.


Assess naturalness: Given a sentence E, this can tell us, does this look like an actual, natural sentence in the target language? If we can learn a model to tell us this, we can use it to assess the  uency of sentences generated by an automated system to improve its results. It could also be used to evaluate sentences generated by a human for purposes of grammar checking or error correction.


Generate text: Language models can also be used to randomly generate text by sampling a sentence E0 from the target distribution: E0 P(E).2 Randomly generating samples from a language model can be interesting in itself { we can see what the model \thinks" is a natural-looking sentences { but it will be more practically useful in the context of the neural translation models described in the following chapters.


In the following sections, we'll cover a few methods used to calculate this probability P(E).



3.1 Word-by-word Computation of Probabilities


As mentioned above, we are interested in calculating the probability of a sentence E = eT1 . Formally, this can be expressed as


P(E) = P(jEj = T; eT1 ); (3)


the joint probability that the length of the sentence is (jEj = T), that the identity of the rst word in the sentence is e1, the identity of the second word in the sentence is e2, up until the last word in the sentence being eT . Unfortunately, directly creating a model of this probability distribution is not straightforward,3 as the length of the sequence T is not determined in advance, and there are a large number of possible combinations of words.


As a way to make things easier, it is common to re-write the probability of the full sentence as the product of single-word probabilities. This takes advantage of the fact that a joint probability { for example P(e1; e2; e3) { can be calculated by multiplying together conditional probabilities for each of its elements. In the example, this means that P(e1; e2; e3) =P(e1)P(e2 j e1)P(e3 j e1; e2).

Figure 2 shows an example of this incremental calculation of probabilities for the sentence \she went home". Here, in addition to the actual words in the sentence, we have introduced an implicit sentence end (\h/si") symbol, which we will indicate when we have terminated the sentence. Stepping through the equation in order, this means we rst calculate the probability of \she" coming at the beginning of the sentence, then the probability of \went" coming next in a sentence starting with \she", the probability of \home" coming after the sentence pre x \she went", and then nally the sentence end symbol \h/si" after \she went home". More generally, we can express this as the following equation:


P(E) = TY+1t=1P(et j et􀀀1 1 ) (4) 


where eT+1 = h/si. So coming back to the sentence end symbol h/si, the reason why we introduce this symbol is because it allows us to know when the sentence ends. In other words, by examining the position of the h/si symbol, we can determine the jEj = T term in our original LM joint probability in Equation 3. In this example, when we have h/si as the 4th word in the sentence, we know we're done and our nal sentence length is 3.

Once we have the formulation in Equation 4, the problem of language modeling now becomes a problem of calculating the next word given the previous words P(et j et􀀀1 1 ). This is much more manageable than calculating the probability for the whole sentence, as we now have a xed set of items that we are looking to calculate probabilities for. The next couple of sections will show a few ways to do so.


3.2 Count-based n-gram Language Models


The rst way to calculate probabilities is simple: prepare a set of training data from which we can count word strings, count up the number of times we have seen a particular string of words, and divide it by the number of times we have seen the context. This simple method, can be expressed by the equation below, with an example shown in Figure 3.

2 Statistical MT Preliminaries

First, before talking about any speci c models, this chapter describes the overall framework of statistical machine translation (SMT) [16] more formally.

First, we de ne our task of machine translation as translating a source sentence F =

f1; : : : ; fJ = fjFj1 into a target sentence E = e1; : : : ; eI = ejEj1 .Thus, any type of translation system can be de ned as a function

^E = mt(F); (1)


which returns a translation hypothesis ^E given a source sentence F as input.

Statistical machine translation systems are systems that perform translation by creating a probabilistic model for the probability of E given F, P(E j F; ), and nding the target sentence that maximizes this probability:


^E= argmaxEP(E j F; ); (2)


where are the parameters of the model specifying the probability distribution. The parameters are learned from data consisting of aligned sentences in the source and target languages, which are called parallel corpora in technical terminology.Within this framework, there are three major problems that we need to handle appropriately in order to create a good translation system:


Modeling: First, we need to decide what our model P(E j F; ) will look like. What parameters will it have, and how will the parameters specify a probability distribution?


Learning: Next, we need a method to learn appropriate values for parameters from training data.


Search: Finally, we need to solve the problem of nding the most probable sentence (solving \argmax"). This process of searching for the best hypothesis and is often called decoding.1


The remainder of the material here will focus on solving these problems.

Neural Machine Translation and Sequence-to-sequence Models

번역


Neural Machine Translation and Sequence-to-sequence Models:

A Tutorial


Graham Neubig

Language Technologies Institute, Carnegie Mellon University


1 소개


This tutorial introduces a new and powerful set of techniques variously called “neural machine translation” or “neural sequence-to-sequence models”. These techniques have been used in a number of tasks regarding the handling of human language, and can be a powerful tool in the toolbox of anyone who wants to model sequential data of some sort. The tutorial assumes that the reader knows the basics of math and programming, but does not assume any particular experience with neural networks or natural language processing. It attempts to explain the intuition behind the various methods covered, then delves into them with enough mathematical detail to understand them concretely, and culiminates with a suggestion for an implementation exercise, where readers can test that they understood the content in practice.


소개에서는 인공신경망 기계 번역(neural machine translation)과 인공신경망 시퀀스-투-시퀀스 모델(neural sequence-to-sequence models)이라는 새롭고 훌륭한 기술을 알아본다. 이 기술은 언어를 다루는 부분과 관련된 다양한 작업에 사용되며, 연속으로 정렬된 모델링이 필요한 사람에게 유용한 도구가 될 수 있다. 이 튜토리얼에서 독자는 기본적인 수학과 프로그래밍에 대한 지식이 있는 것을 가정하지만, 신경망 및 자연어처리와 같은 경험은 제외한다. 이 논문에서는 다양한 방법 이면에 담긴 직관을 살펴보고, 이를 구체적으로 이해할 수 있는 수학적인 부분가지 알아본 후, 여러분들이 제대로 이해했는지 연습해본다.


1.1 배경


Before getting into the details, it might be worth describing each of the terms that appear in the title \Neural Machine Translation and Sequence-to-sequence Models". Machine translation is the technology used to translate between human language. Think of the universal translation device showing up in sci- movies to allow you to communicate eortlessly with those that speak a dierent language, or any of the plethora of online translation web sites that you can use to assimilate content that is not in your native language. This ability to remove language barriers, needless to say, has the potential to be very useful, and thus machine translation technology has been researched from shortly after the advent of digital computing. We call the language input to the machine translation system the source language, and call the output language the target language. Thus, machine translation can be described as the task of converting a sequence of words in the source, and converting into a sequence of words in the target. The goal of the machine translation practitioner is to come up with an eective model that allows us to perform this conversion accurately over a broad variety of languages and content.

The second part of the title, sequence-to-sequence models, refers to the broader class of models that include all models that map one sequence to another. This, of course, includes machine translation, but it also covers a broad spectrum of other methods used to handle other tasks as shown in Figure 1. In fact, if we think of a computer program as something that takes in a sequence of input bits, then outputs a sequence of output bits, we could say that every single program is a sequence-to-sequence model expressing some behavior (although of course in many cases this is not the most natural or intuitive way to express things).


자세히 들어가기 전, Neural Machine Translation and Sequence-to-sequence Models: A Tutorial 논문에 등장하는 용어에 대해 정리해보자. 기계 번역이란 언어간 번역을 위해 사용되는 기술이다. 서로 다른 언어를 사용하는 사람들끼리 대화하는 공상 과학 영화에 등장하는 번역기나, 모국어로 작성되지 않은 컨텐츠를 이해할 때 사용되는 온라인 번역 사이트등을 생각해보자. 이렇게 언어 장벽을 무너뜨리는 이 기술은 당연히 많은 잠재력을 가졌으며, 따라서 디지털 컴퓨터이 시작된 후부터 해당 기술이 연구돼 왔다.

기계 번역 시스템에 입력되는 언어를 원시 언어(source language), 출력되는 언어를 대상 언어(target language)라고 한다. 따라서 기계 번역은 원시 언어의 단어 시퀀스를 대상 언어의 단어 시퀀스로 변환하는 과장이다. 기계 번역 시스템의 목표는 이러한 변환이 다양한 컨텐츠와 언어간에 효과정으로 수행하도록 모델을 구성하는 것이다.

두번째 부분인 시퀀스-투-시퀀스 모델은 하나의 시퀀스에서 다른 시퀀스로 맵핑되는 모든 모델을 포괄적으로 가리킨다. 물론, 기계 번역도 포함되나, 그림 1과 같이 다양한 작업을 다루는 광범위한 범위도 포함된다. 사실, 입력 비트 시퀀스를 받아 출력 비트 시퀀스를 내보내는 컴퓨터 프로그램을 생각할 수 있는데, 결국 모든 싱글 프로그램이 이러한 시퀀스-투-시퀀스 모델임을 설명할 수 있다(하지만 많은 경우에 있어 일반적이거나 직관적인 형태는 아니다).


그림 1: 시퀀스-투-시퀀스 모델링 작업의 예


The motivation for using machine translation as a representative of this larger class ofsequence-to-sequence models is many-fold:


1. Machine translation is a widely-recognized and useful instance of sequence-to-sequence models, and allows us to use many intuitive examples demonstrating the diculties encountered when trying to tackle these problems.


2. Machine translation is often one of the main driving tasks behind the development of new models, and thus these models tend to be tailored to MT rst, then applied to other tasks.


3. However, there are also cases where MT has learned from other tasks as well, and introducing these tasks helps explain the techniques used in MT as well.


이렇게 방대한 시퀀스-투-시퀀스 모델을 대표하는 것으로 기계 번역을 사용하는 이유는 여러 가지가 있다.


1. 기계 번역은 널리 알려져있고, 시퀀스-투-시퀀스 모델의 유용한 예이며, 이 같은 문제를 해결할때 마주하는 어려움을 잘 설명한다.

2. 기계 번역 새로운 모델 개발이면에 핵심적인 역할을 종종하며, 먼저 기계 번역에 최적화된 모델을 생성하며, 그 후 다른 영역에 적용된다.

3. 하지만, 다른 영역에서도 배워오는 경우가 있으며, 기계 번역에 사용되는 기술 또한 잘 설명한다.



1.2 목차 구성


This tutorial first starts out with a general mathematical de nition of statistical techniques for machine translation in Section 2. The rest of this tutorial will sequentially describe techniques of increasing complexity, leading up to attentional models, which represent the current state-of-the-art in the eld.

First, Sections 3-6 focus on language models, which calculate the probability of a target sequence of interest. These models are not capable of performing translation or sequence transduction, but will provide useful preliminaries to understand sequence-to-sequence models.

 

  • - Section 3 describes n-gram language models, simple models that calculate the probability of words based on their counts in a set of data. It also describes how we evaluate how well these models are doing using measures such as perplexity.


  • -Section 4 describes log-linear language models, models that instead calculate the probability of the next word based on features of the context. It describes how we can learn the parameters of the models through stochastic gradient descent { calculating derivatives and gradually updating the parameters to increase the likelihood of the observed data.


  • -Section 5 introduces the concept of neural networks, which allow us to combine together multiple pieces of information more easily than log-linear models, resulting in increased modeling accuracy. It gives an example of feed-forward neural language models, which calculate the probability of the next word based on a few previous words using neural networks.


  • -Section 6 introduces recurrent neural networks, a variety of neural networks that have mechanisms to allow them to remember information over multiple time steps. These lead to recurrent neural network language models, which allow for the handling of long-term dependencies that are useful when modeling language or other sequential data.


Finally, Sections 7 and 8 describe actual sequence-to-sequence models capable of performing

machine translation or other tasks.


  • -Section 7 describes encoder-decoder models, which use a recurrent neural network to encode the target sequence into a vector of numbers, and another network to decode this vector of numbers into an output sentence. It also describes search algorithms to generate output sequences based on this model.


  • -Section 8 describes attention, a method that allows the model to focus on dierent partsof the input sentence while generating translations. This allows for a more efficient and intuitive method of representing sentences, and is often more eective than its simpler encoder-decoder counterpart.


섹션 2에서는 기계 번역과 관련된 통계적인 기술에 대한 일반적인 수학적 정의부터 시작한다. 그 후 이 분야의 최신 모델들을 알아보며 점차 복잡한 기술을 하나하나씩 배워본다.

먼저, 섹션3-6에서는 대상 언어의 시퀀스 확률을 계산하는 언어 모델을 알아본다. 이 모델은 번역 및 시퀀스 변환은 불가하지만, 시퀀스 모델을 이해할 수 있는 좋은 예제를 제공한다.


  • - 섹션 3는 데이터셋내 단어의 개수를 기반으로 해당 단어의 확률을 계산하는 n-gram 언어 모델에 대해 다룬다. 해당 모델이 얼마나 측정치를 잘 사용하고 있는 평가하는 방법(Perplexity)도 알아본다.


  • - 섹션 4는 문맥의 특성을 기반으로 다음 단어의 확률을 계산하는 모델인 로그-선형 언어 모델(log-linear language model)에 대해 알아본다. 확률적 경사 하강법(stochastic gradient descent, 관찰된 데이터의 우도(likelihood)를 증가시켜 파라미터를 계속 미분해 계산하는 방법)을 통해 모델의 파라미터가 잘 학습됐는지 알아본다.


  • - 섹션 5는 로그-선형 모델보다 정확도를 높여 좀 더 쉽게 여러 정보를 조합할 수 있는 신경망에 대해 알아본다. 신경망을 이용해 이전 단어를 기반으로 다음 단어의 확률을 계산하는 피드-포워드 신경망 언어 모델의 예제도 살펴본다.


  • - 섹션 6는 여러 번 정보를 계속 기억하는 메커니즘 형태의 신경망인재귀신경망(recurrent neural networks)에 살펴본다. 이는 결국 재귀 신경망 언어 모델인데, 언어나 순차 데이터를 모델링할 때 유용한 장기 의존성(long-term dependencies)를 다룰 수 있다.


마지막으로, 섹션 7, 8은 기계 번역 및 다른 작업을 실제로 수행하는 시퀀스-투-시퀀스 모델에 대해 살펴본다.


  • -섹션 7은 대상 언어를 숫자 벡터로 인코딩하며, 반대로 벡터를 출력 문자으로 디코딩하는 재귀신경망을 사용하는 인코더-디코더(encoder-decode) 모델에 대해 알아본다. 이 모델을 기반으로 출력 문장을 생성하는 검색 알고리즘 또한 알아본다.


  • - 섹션 8은 번역시 입력 문장의 다른 부분을 살펴보는 모델링이 가능한 방법인 어텐션(attention)에 대해 알아본다. 이는 문장으로 좀 더 효율적이고 직관적인 방법으로 나타낼 수 있게하며, 인코더-디코더와 비교해 간단해 효과적이다.




'Paper > Neural Machine Translation' 카테고리의 다른 글

[번역]n-gram Language Models  (0) 2018.11.01
[번역]Statistical MT Preliminaries  (0) 2018.11.01

+ Recent posts