## Bracketing Encodings for 2-Planar Dependency Parsing
Michalina Strzyz
David Vilares
Carlos G´ omez-Rodr´ Ĺguez
Universidade da CoruË na, CITIC
FASTPARSE Lab, LyS Research Group
Departamento de Ciencias de la Computaci´ on y Tecnolog´ Äąas de la Informaci´ on Campus de ElviË na, s/n, 15071 A CoruË na, Spain
{ michalina.strzyz,david.vilares,carlos.gomez } @udc.es
## Abstract
We present a bracketing-based encoding that can be used to represent any 2-planar dependency tree over a sentence of length n as a sequence of n labels, hence providing almost total coverage of crossing arcs in sequence labeling parsing. First, we show that existing bracketing encodings for parsing as labeling can only handle a very mild extension of projective trees. Second, we overcome this limitation by taking into account the well-known property of 2-planarity, which is present in the vast majority of dependency syntactic structures in treebanks, i.e., the arcs of a dependency tree can be split into two planes such that arcs in a given plane do not cross. We take advantage of this property to design a method that balances the brackets and that encodes the arcs belonging to each of those planes, allowing for almost unrestricted non-projectivity ( âź 99 . 9% coverage) in sequence labeling parsing. The experiments show that our linearizations improve over the accuracy of the original bracketing encoding in highly non-projective treebanks (on average by 0 . 4 LAS), while achieving a similar speed. Also, they are especially suitable when PoS tags are not used as input parameters to the models.
## 1 Introduction
In the last few years, approaches that cast syntactic parsing as the task of finding a sequence have gained traction for both dependency and constituency parsing. In sequence-to-sequence (seq2seq) parsing (Vinyals et al., 2015; Li et al., 2018), parse trees are represented as arbitrary-length sequences, where the attention mechanism can be seen as an abstraction of the stack and the buffer in transition-based systems that decides what words are relevant to make a decision at a given time step. In sequence labeling parsing (G´ omez-Rodr´ Ĺguez and Vilares, 2018; Strzyz et al., 2019b), the tree for a sentence of length n is represented as a sequence of n labels, one per word, so the parsing process is word-synchronous (Kitaev and Klein, 2019) and can be addressed by frameworks traditionally used for other natural language processing tasks, such as part-of-speech tagging or named-entity recognition. Current sequence labeling parsers combine competitive accuracy with high computational efficiency, while providing extra simplicity using off-the-shelf sequence labeling software without the need for ad-hoc parsing algorithms.
In the realm of dependency parsing, pioneering work dates back to Spoustov´ a and Spousta (2010), who used a relative PoS-tag based encoding to represent trees as label sequences, but the resulting accuracy was not practical even for the standards of the time, probably due to the inability of pre-deeplearning architectures to successfully learn the representation. Using more modern architectures with the ability to contextualize words based on the sentence, and various tree encodings, Strzyz et al. (2019b) were the first to show that competitive accuracy could be reached.
Subsequently, this accuracy has been improved further by techniques like the use of multi-task learning to parse dependencies and constituents together (Strzyz et al., 2019a) and of contextualized embeddings (Vilares et al., 2020).
This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ .
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Dependency Parse Diagram
### Overview
The image depicts a dependency parse diagram for a sentence. It shows the relationships between words in the sentence, with arrows indicating the direction of dependency. The diagram includes word tokens labeled w0 through w6, along with their corresponding part-of-speech tags or grammatical roles.
### Components/Axes
* **Nodes:** The nodes represent words in the sentence, labeled as w0, w1, w2, w3, w4, w5, and w6.
* **Labels:** Each word node has a label below it indicating its role or part-of-speech:
* w0: root
* w1: Ă
* w2: ////>
* w3: />
* w4: />
* w5: >
* w6: >
* **Edges:** The edges are represented by curved arrows, indicating the dependency relationships between the words. The arrow points from the head (governor) to the dependent.
### Detailed Analysis
The dependency relationships are as follows:
* w0 (root) depends on w1.
* w1 depends on w2.
* w1 depends on w3.
* w1 depends on w4.
* w1 depends on w5.
* w1 depends on w6.
* w2 depends on w3.
* w2 depends on w4.
* w2 depends on w5.
### Key Observations
* The word "w1" appears to be the central node, with multiple words depending on it.
* The "root" node (w0) depends on "w1", indicating that "w1" is the main verb or head of the sentence.
* The diagram shows a hierarchical structure, with dependencies flowing from the root to the other words.
### Interpretation
The dependency parse diagram visually represents the grammatical structure of a sentence. It shows how words are related to each other and how they depend on each other to form a coherent meaning. The diagram highlights the importance of "w1" as the central element of the sentence, with other words modifying or complementing its meaning. The specific labels (Ă, ////>, /> , >) likely represent specific grammatical categories or features, which would require additional context to fully understand.
</details>
(a) Projective encoding restricted to a single plane. Infeasible to reconstruct a non-projective sentence.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Dependency Parse Diagram
### Overview
The image depicts a dependency parse diagram, illustrating the relationships between words in a sentence. The diagram shows nodes representing words, with arcs indicating dependencies. Some arcs are solid black, while others are dotted red, possibly indicating different types of dependencies or relationships.
### Components/Axes
* **Nodes:** The nodes are labeled `w0` through `w6`.
* `w0`: Labeled "root" below.
* `w1`: Labeled "â " below.
* `w2`: Labeled "/// >" below.
* `w3`: Labeled "/*>" below.
* `w4`: Labeled ">*" below, and is colored red.
* `w5`: Labeled ">" below.
* `w6`: Labeled "â " below.
* **Arcs:** The arcs connect the nodes, indicating dependencies. Some arcs are solid black, while others are dotted red.
* A solid black arc connects `w0` to `w1`.
* A solid black arc connects `w0` to `w5`.
* A solid black arc connects `w1` to `w2`.
* A solid black arc connects `w1` to `w3`.
* A dotted red arc connects `w1` to `w4`.
* A dotted red arc connects `w3` to `w4`.
### Detailed Analysis or ### Content Details
* **Node Labels:**
* `w0`: "root"
* `w1`: "â "
* `w2`: "/// >"
* `w3`: "/*>"
* `w4`: ">*" (red)
* `w5`: ">"
* `w6`: "â "
* **Arc Connections:**
* `w0` -> `w1` (solid black)
* `w0` -> `w5` (solid black)
* `w1` -> `w2` (solid black)
* `w1` -> `w3` (solid black)
* `w1` -> `w4` (dotted red)
* `w3` -> `w4` (dotted red)
### Key Observations
* The node `w4` and the arcs connecting to it are colored red, suggesting a special significance or type of dependency.
* The labels below the nodes appear to be symbols or placeholders, possibly representing grammatical elements or word types.
### Interpretation
The diagram represents a dependency parse of a sentence, showing the relationships between words. The red node and arcs may indicate a specific type of dependency or an error/correction in the parse. The symbols below the nodes likely represent grammatical information. The "root" node indicates the starting point of the parse. The "â " symbols might represent empty categories or null elements. The other symbols (/// >, /*>, >) likely represent specific grammatical features or categories. The diagram illustrates the hierarchical structure of the sentence and the dependencies between its components.
</details>
- (b) Non-projective 2-planar encoding with second-plane-averse greedy plane assignment. The arc w 3 â w 6 is not assigned a plane because it would cross arcs belonging to both planes, which is forbidden by the 2-planar constraint.
- (c) Non-projective 2-planar encoding with plane assignment based on restriction propagation on the crossings graph.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Dependency Parse Diagram: Example Sentence Structure
### Overview
The image depicts a dependency parse diagram illustrating the grammatical structure of a sentence. The diagram shows words labeled w0 through w6, with arcs indicating dependencies between them. Some dependencies are represented with solid black lines, while others are represented with dotted red lines. The words are annotated with symbols and characters, some of which are in red.
### Components/Axes
* **Nodes:** The nodes represent words in the sentence, labeled as w0, w1, w2, w3, w4, w5, and w6.
* **Dependencies (Black Arcs):** Solid black arcs indicate primary dependencies between words.
* **Dependencies (Red Arcs):** Dotted red arcs indicate secondary or different types of dependencies.
* **Word Annotations:** Each word node has an annotation below it. These annotations include "root", "â ", and various symbols like "/", "*", ">". Some of these annotations are in red.
### Detailed Analysis or Content Details
* **w0:** Labeled as "root".
* **w1:** Labeled as "â ".
* **w2:** Labeled as "/*//>". The "/*" part is in red.
* **w3:** Labeled as "/*>". The "/*" part is in red.
* **w4:** Labeled as "/>*". The "/>*" part is in red.
* **w5:** Labeled as ">*". The ">*" part is in red.
* **w6:** Labeled as ">".
**Dependencies (Black Arcs):**
* A black arc connects w0 to w1.
* A black arc connects w1 to w2.
* A black arc connects w1 to w3.
* A black arc connects w1 to w6.
* A black arc connects w3 to w5.
**Dependencies (Red Arcs):**
* A dotted red arc connects w1 to w4.
* A dotted red arc connects w2 to w5.
### Key Observations
* The "root" node (w0) is the starting point of the dependency parse.
* The node w1 has multiple outgoing dependencies, suggesting it is a central element in the sentence structure.
* The red arcs indicate dependencies that are distinct from the primary dependencies represented by the black arcs.
* The red annotations below the nodes w2, w3, w4, and w5 suggest these words have special significance or are associated with specific grammatical categories.
### Interpretation
The diagram illustrates the dependency relationships between words in a sentence. The solid black arcs represent the core grammatical structure, while the dotted red arcs indicate additional or alternative dependencies. The annotations below each word provide further information about its role or category within the sentence. The red annotations highlight specific elements that may be of particular interest or importance in the analysis of the sentence structure. The diagram is a visual representation of how words relate to each other grammatically, providing insights into the sentence's meaning and construction.
</details>
Figure 1: Bracketing-based encodings with their plane assignment strategies for a non-projective sentence. The red, dotted lines refer to the arcs represented in the second plane, denoted by * in the encoding label.
While parsing as sequence labeling does not need specific parsing algorithms or data structures, as in graph-based or transition-based parsing, the responsibility of providing suitable parsing representations with reasonable coverage and learnability falls instead on the encoding used to represent trees as sequences of labels. Strzyz et al. (2019b) used four different encodings that obtained substantially different parsing accuracies in the experiments, with two encodings achieving competitive accuracy: the relative PoS tag (rel-PoS) encoding of Spoustov´ a and Spousta (2010) and a new encoding based on balanced brackets, inspired by Yli-Jyr¨ a and G´ omez-Rodr´ Ĺguez (2017). While the encoding of Spoustov´ a and Spousta (2010) achieved a good accuracy, and it has full coverage of non-projective dependency trees, it requires PoS tags to encode the dependency arcs. This can be seen as a weakness, not just because computing and feeding PoS tags increases the latency, but also because the traditional assumption that PoS tagging is needed for parsing is being increasingly called into question (de Lhoneux et al., 2017; Smith et al., 2018; Kitaev and Klein, 2018; Anderson and G´ omez-Rodr´ Ĺguez, 2020). Low-frequency PoS tags can cause sparsity in the encoding, and low-quality PoS tags could be a potential source of errors in low-resource languages.
For this reason, Lacroix (2019) proposed two alternative encodings with the same relative indexing philosophy, but without using PoS tags. However, these encodings require a composition of two sequence labeling processes instead of one.
On the other hand, the bracketing encoding inspired in (Yli-Jyr¨ a and G´ omez-Rodr´ Ĺguez, 2017) represents the trees independently of PoS tags or any other previous tagging step, but it has the limitation of being restricted to a very mild extension of projective trees.
Contribution. In this paper, we extend the idea of the bracketing-based encoding to non-projective parsing by defining a variant that can encode all 2-planar dependency trees (Yli-Jyr¨ a, 2003). 2planar dependency trees have been shown to cover the vast majority of non-projective trees in attested sentences (G´ omez-Rodr´ Ĺguez, 2016) and have been used in transition-based parsing
(G´ omez-Rodr´ Ĺguez and Nivre, 2013; Fern´ andez-Gonz´ alez and G´ omez-Rodr´ Ĺguez, 2018). We show that our encoding provides better parsing accuracy than the original bracketing-based encoding on highly nonprojective UD treebanks; and than the rel-PoS encoding when assuming PoS tags are not fed as input parameters to the models. The source code is available at https://github.com/mstrise/dep2label .
## 2 Preliminaries
Given a sentence w 1 . . . w n , we associate the words with nodes 0 , 1 , . . . , n , where 0 is a dummy root node. Then, a dependency graph is an edge-labeled graph ( V, E ) with V = { 0 , 1 , . . . , n } and E a set of edges of the form ( h, d, l ) where h â V is the head, d â V \{ 0 } is the dependent, and l is the dependency label. The goal of a dependency parser is to find a dependency graph that is a tree (i.e. without cycles, and with no dependent having more than one head) rooted at node 0 .
## 2.1 Bracketing encoding
Dependency arcs are encoded through a sequence of bracket elements from a set B = { < , / , / , > } . A balanced pair of brackets ( < , / ) in the labels of the words w i and w j represents a left arc from word w j to w i -1 . A balanced pair of brackets ( / , > ) in the labels of the words w i and w j represents a right arc from word w i -1 to w j . one incoming arc and several outgoing arcs, resulting on labels composed of several such brackets, following the regular expression (<)?((\)*|(/)*)(>)? . As shown in Figure 1a, the token w 2 is assigned a label ///> that can be interpreted as: the previous token w 1 has three outgoing arcs to the right and one of them matches the left incoming arc of w 2 ( /> ) meaning that w 1 is the head of w 2 . The remaining two dependents will be given by the matching > in the labels of the following words.
Since each opening bracket is always matched to the closest same-direction closing bracket, this encoding is unable to handle crossing arcs in the same direction. An attempt of encoding such crossing arcs will result in decoding into non-crossing arcs.However, the encoding can handle crossing arcs in opposite directions, as long as left and right brackets are balanced independently (e.g. by using separate stacks for each kind of bracket). The paper by Strzyz et al. (2019b) erroneously describes the encoding as only supporting projective trees. In fact, the implementation in that paper is supporting this mild extension of projectivity where crossing arcs in opposite directions are allowed.
## 2.2 2-Planarity
A dependency graph ( V, E ) is said to be k -planar, for k ⼠1 , if there is a partition of the edges into sets E 1 , . . . , E k , called planes, in such a way that edges that are in the same plane do not cross. For k = 1 , this corresponds to the concept of a noncrossing dependency graph (Kuhlmann and Jonsson, 2015) or planar linear arrangement (Chao and Sha, 1992) (not to be confused with a planar graph). Under the assumption of trees rooted at the dummy root node 0 , 1 -planar trees are equivalent to the well-known projective trees. For k ⼠2 , this means that the dependency graph (together with the linear order of the words) is a k -page book embedding of a graph (see (Pitler et al., 2013)). Intuitively, a k -planar graph is one where each arc can be assigned one out of k colors in such a way that arcs with the same color do not cross (see Figure 1).
2 -planarity has been shown to be particularly relevant for parsing, as the overwhelming majority of syntactic structures in syntactic treebanks has been shown to be 2-planar (G´ omez-Rodr´ Ĺguez and Nivre, 2013; G´ omez-Rodr´ Ĺguez, 2016) and efficient transition-based parsers have been proposed for this set of structures (G´ omez-Rodr´ Ĺguez and Nivre, 2010; Fern´ andez-Gonz´ alez and G´ omez-Rodr´ Ĺguez, 2018).
## 3 2-Planar bracketing encodings
In order to support the extended non-projective coverage provided by 2-planarity in the bracketing system, we balance a different set of brackets for each plane.
We introduce a set of 'star' bracket elements denoting arcs belonging to the second plane, B â = { < â , / â , / â , > â } . A token w i can be assigned elements from both B and B â . Brackets only match when they are on the same plane, i.e., ( < , / ) , ( / , > ) are matching pairs of brackets that encode arcs in the first plane, and ( < â , / â ) , ( / â , > â ) are matching pairs of brackets that encode arcs in the second plane. The decoding process is implemented by operating on separate stacks for the first-plane brackets and the second-plane brackets.
## 3.1 Plane assignment strategies
According to the definition in Section 2.2, a tree is 2-planar if its edges can be partitioned into two planes, E 1 and E 2 , such that edges in the same plane do not cross. However, often this partition is not unique (for example, in the case of trees that are also 1 -planar, any partition satisfies the condition). Thus, for the encoding in Section 3 to provide a single sequence of labels for each gold tree during training, we need to fix a plane assignment strategy , i.e., a canonical way of assigning each arc to a plane to obtain such a partition. While the number of possible partitions is exponential in the size of the tree, desirable partitions should be easily learnable, i.e., follow predictable patterns. Given that the amount of crossing dependencies in treebanks is scarce (Ferrer-i-Cancho et al., 2018), it makes sense to look for partitions that do not make use of an extra plane when not needed, so that the parsing of sentences or fragments without crossing arcs does not become more difficult or need more output labels than in the basic bracketing encoding (as they will only use one plane and thus one set of brackets). Following this general principle, we define the following plane assignment strategies:
Second-Plane-Averse Greedy Plane Assignment Arcs in the gold tree are traversed in left-to-right order of their right endpoint, with shortest arcs first when they share a right endpoint (this is the order in which arcs will be decoded using a stack, see Section 4). For each arc a , we assign the first plane if possible (i.e., if no arc crossing a has already been assigned the first plane). Otherwise, we assign the second plane if possible, or no plane if the arc a crosses arcs assigned to both planes. The process is formally described with pseudocode in Algorithm 1.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Algorithm: 2p-greedy
### Overview
The image presents the pseudocode for an algorithm named "2p-greedy". The algorithm takes a set of arcs T and an input length n as input and returns two sets (planes) of arcs, P1 and P2. The algorithm iterates through possible arc combinations and attempts to assign each arc to either P1 or P2 based on whether it crosses existing arcs in those sets.
### Components/Axes
The algorithm consists of the following components:
* **Algorithm Name:** 2p-greedy
* **Input:**
* A set of arcs T
* Input length n
* **Result:** Two sets (planes) of arcs P1, P2
* **Initialization:**
* P1 is initialized as an empty set (P1 â Ă)
* P2 is initialized as an empty set (P2 â Ă)
* **Outer Loop:** Iterates from xr = 1 to n
* **Inner Loop:** Iterates from xl = xr - 1 downto 0
* **Conditional Check:** Checks if there exists an arc 'a' in T such that a = (xl, xr, l) or a = (xr, xl, l)
* **Assignment:**
* If the condition is true, nextArc is assigned to 'a' (nextArc â a)
* C is defined as the set of arcs in P1 or P2 that cross 'a' (C â {b â (P1 ⪠P2) | b crosses a})
* If C ⊠P1 = Ă, then nextArc is added to P1 (P1 â P1 ⪠{nextArc})
* Else, if C ⊠P2 = Ă, then nextArc is added to P2 (P2 â P2 ⪠{nextArc})
* Otherwise, do nothing (failed to assign nextArc to a plane)
* **Return:** Returns the sets P1 and P2 (return P1, P2)
### Detailed Analysis or Content Details
The algorithm attempts to greedily assign arcs to two sets (planes), P1 and P2. The assignment is based on whether adding an arc to a set would cause it to cross any existing arcs in that set. The algorithm iterates through all possible arc combinations defined by the input length 'n'.
The core logic lies within the nested loops and the conditional checks. The algorithm first checks if a valid arc 'a' exists in the input set T. If it does, it then checks if adding 'a' to either P1 or P2 would cause it to cross any existing arcs in those sets. If adding 'a' to P1 doesn't cause any crossings, it's added to P1. Otherwise, if adding 'a' to P2 doesn't cause any crossings, it's added to P2. If adding 'a' to either set would cause crossings, it's not assigned to either set.
### Key Observations
* The algorithm is "greedy" because it makes decisions based on the immediate impact of adding an arc to a set, without considering the long-term consequences.
* The algorithm aims to minimize crossings within each set of arcs.
* The algorithm may fail to assign some arcs to either P1 or P2 if they cross existing arcs in both sets.
### Interpretation
The 2p-greedy algorithm is designed to partition a set of arcs into two sets (planes) such that the number of crossings within each set is minimized. This type of algorithm could be used in various applications, such as graph drawing or conflict resolution. The algorithm's performance depends on the specific characteristics of the input set of arcs. In some cases, it may produce a good partitioning, while in other cases, it may fail to assign many arcs or produce a partitioning with a high number of crossings. The algorithm's greedy nature means that it may not always find the optimal solution.
</details>
## Second-Plane-Averse Plane Assignment based on Restriction Propagation on the Crossings Graph
While the greedy approach is very simple, it has the disadvantage that it may make suboptimal decisions leading to reduced coverage: assigning an arc to a given plane may seem like a good local decision, but depending on how arcs cross each other in the whole tree, it may lead to a subsequent situation where an arc cannot be assigned a plane even if the tree is actually 2-planar.
An example of this can be seen in Figure 1b: the greedy strategy will assign the arcs w 1 â w 3 and w 1 â w 5 to the first plane, which in a local context is the simplest thing to do. However, the fact that w 1 â w 3 crosses w 2 â w 4 (which is thus assigned to the second plane) and w 3 â w 6 crosses both w 1 â w 5 (first plane) and w 2 â w 4 (second plane) then means that it is impossible to assign a plane to the arc w 3 â w 6 . This could have been prevented by assigning arc w 1 â w 5 to the second plane, but a greedy algorithm has no way to anticipate this. To deal with this problem, we propagate restrictions by traversing the crossings graph, i.e., a graph where its nodes represent the edges in the gold tree and two nodes are linked if the corresponding edges cross (G´ omez-Rodr´ Äąguez and Nivre, 2013). Whenever we assign a given arc to plane 1, then we forbid plane 1 for its neighbors in the crossings graph (i.e. the arcs that cross it), we forbid plane 2 for the neighbors of its neighbors, plane 1 for the neighbors of those, and so on. For arcs assigned to plane 2, we proceed symmetrically.
Thus, the traversal order of arcs is the same as in the previous strategy, but for each new arc a , we look at the restrictions and assign it to the first plane if allowed, otherwise to the second plane if allowed, and finally to no plane if neither are allowed. In this case, the latter will only happen for non-2-planar trees: it is easy to show that situations where both planes are forbidden for the same arc can only happen if the crossings graph has a cycle of odd length, which is equivalent to the tree not being 2-planar (see (G´ omez-Rodr´ Ĺguez and Nivre, 2013)). Thus, this strategy guarantees full coverage of 2-planar structures. The pseudocode of the strategy can be seen in Algorithm 2, where P 1 and P 2 represent the arcs forbidden from planes 1 and 2, respectively.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Algorithm: 2p-prop
### Overview
The image presents the pseudocode for an algorithm named "2p-prop". It details the input, result, and the steps involved in the algorithm, including a function called "Propagate". The algorithm appears to be related to assigning arcs to two sets (planes) denoted as P1 and P2.
### Components/Axes
* **Algorithm Name:** 2p-prop
* **Input:** A set of arcs T, and input length n
* **Result:** Two sets (planes) of arcs P1, P2
* **Function:** Propagate (Edge sets T, P1, P2, Edge e, Plane i)
* PĚi â PĚi U {e};
* // e forbidden from plane i
* for (e' â T | e' crosses e) do
* if e' â P3-i then
* (PĚ1, PĚ2) â Propagate (T, PĚ1, PĚ2, e', 3 - i);
* return PĚ1, PĚ2;
* P1 â Ă, P2 â Ă, PĚ1 â Ă, PĚ2 â Ă;
* for xr â 1 to n do
* for xl â xr - 1 downto 0 do
* if âa â T | a = (xl, xr, l) ⨠a = (xr, xl, l) then
* nextArc â a;
* if nextArc â PĚ1 then
* P1 â P1 U {nextArc};
* Propagate (T, PĚ1, PĚ2, nextArc, 2);
* else if nextArc â PĚ2 then
* P2 â P2 U {nextArc};
* Propagate (T, PĚ1, PĚ2, nextArc, 1);
* else
* do nothing (failed to assign nextArc to a plane);
* return P1, P2;
### Detailed Analysis or ### Content Details
The algorithm initializes two sets, P1 and P2, and their complements PĚ1 and PĚ2, to empty sets. It then iterates through a range defined by the input length 'n' using nested loops. Inside the loops, it checks for the existence of an arc 'a' in the set 'T' that satisfies a specific condition related to 'xl', 'xr', and 'l'. If such an arc exists, it's assigned to 'nextArc'. The algorithm then attempts to add 'nextArc' to either P1 or P2, based on whether it's already present in their complements PĚ1 or PĚ2, respectively. The "Propagate" function is called after adding an arc to either P1 or P2. If 'nextArc' cannot be assigned to either plane, the algorithm does nothing. Finally, the algorithm returns the sets P1 and P2.
### Key Observations
* The algorithm uses a function called "Propagate" recursively.
* The algorithm attempts to assign arcs to two sets (planes) based on certain conditions.
* The algorithm includes nested loops that iterate through a range defined by the input length 'n'.
* The algorithm uses complements of the sets P1 and P2 (PĚ1 and PĚ2) to determine where to assign arcs.
### Interpretation
The algorithm "2p-prop" appears to be designed to partition a set of arcs 'T' into two sets (planes) P1 and P2. The "Propagate" function likely enforces constraints or propagates information based on the edge sets and the assigned plane. The nested loops and the conditions involving 'xl', 'xr', and 'l' suggest that the algorithm considers the spatial relationships or properties of the arcs. The use of complements PĚ1 and PĚ2 indicates a strategy to avoid assigning the same arc to both planes. The algorithm's overall goal is to assign arcs to planes while satisfying certain constraints, potentially related to geometric or topological properties.
</details>
Table 1: Percentage of non-projective sentences and dependencies of the selected UD treebanks, where JapaneseGSD and GalicianCTG are control treebanks.
| Language | %non-projective sentences | %non-projective dependencies | Language | %non-projective sentences | %non-projective dependencies |
|-----------------------|-----------------------------|--------------------------------|----------------|-----------------------------|--------------------------------|
| Ancient Greek Perseus | 63.87 | 10.14 | Korean Kaist | 21.7 | 2.55 |
| Basque BDT | 33.17 | 4.69 | Danish DDT | 21.5 | 1.74 |
| Hungarian Szeged | 27.11 | 1.97 | Gothic PROIEL | 17.57 | 2.53 |
| Portuguese Bosque | 23.31 | 1.85 | Lithuanian HSE | 17.49 | 1.27 |
| Urdu UDTB | 22.57 | 1.32 | Japanese GSD | 0 | 0 |
| Afrikaans AfriBooms | 22.34 | 1.62 | Galician CTG | 0 | 0 |
Switch-averse plane assignment strategies Another possibility is to implement variants of the previous two strategies that are switch-averse , rather than second-plane-averse . These variants work like the previous strategies, except for the difference than when both planes can be assigned to the current arc, we assign the last plane used, instead of always preferring to assign the first plane.
The implementation of the 2-planar transition-based parser by G´ omez-Rodr´ Ĺguez and Nivre (2010) used a switch-averse restriction-propagation strategy. This is a reasonable choice because in their transition-based parser it minimizes the number of transitions used: the algorithm's state holds the 'current' plane being used, and switching to the other plane costs one transition. In our sequence labeling context, where this is no longer true (the model always makes n predictions for a sequence of length n ), we made some initial experiments with switch-averse strategies but we found that they performed consistently (albeit slightly) worse than second-plane-averse strategies, so we discarded them for our experiments.
## 4 Bracketing decoding
When a sentence is represented with the bracketing encoding in a single plane, a valid left arc is associated with a pair of matching brackets < and / while a right arc is associated with a pair of / and > . For each sentence we create two initially empty stacks, Ď L and Ď R , in order to keep the elements separate with respect to the arc direction. Thus, the output labels generated by the system are read from left to right, decomposed into their brackets, and then brackets corresponding to left arcs are processed in Ď L and those that encode right arcs are processed in Ď R . In order to handle a second plane with brackets represented as ( < â , / â ) and ( / â , > â ), we simply use additional stacks: Ď â L and Ď â R .
More particularly, decoding proceeds by reading a label for each token and pushing each opening bracketing element to the corresponding stack while preserving the token's index. For instance, when reading a new label that contains < , the bracket element is pushed into the Ď L stack and can only be popped once there is a later matching label with a closing bracketing element / that will be used to create a left arc, by recovering the index stored together with the < bracket. Analogously, right arcs are processed in the same way, but in a different stack.
Postprocessing Decoded labels do not ensure creating a well-formed tree. For that reason, we adapt some common heuristics for all encodings in order to postprocess them. In case some of the brackets in any of the stacks are unbalanced, the outermost bracket elements are discarded. Tokens that are not assigned any head are recovered by attaching them to the word that is attached to the dummy root (i.e., the syntactic head of the sentence). Cycles are also solved by removing the leftmost arc in the cycle.
## 5 Experiments
Data We extracted the most non-projective treebanks from UDv2.4 (Nivre and others, 2019) based on the percentage of non-projective sentences, and discarded some of them due to the lack of a pretrained UDpipe model or due to the lack of a development set. The selected treebanks were: Ancient GreekPerseus, BasqueBDT, HungarianSzeged, PortugueseBosque, UrduUDTB, AfrikaansAfriBooms, KoreanKaist, DanishDDT, GothicPROIEL, LithuanianHSE. In addition, two fully projective treebanks (GalicianCTG and
JapaneseGSD) were included as control treebanks. Table 1 shows the selected treebanks with their percentage of non-projective sentences and dependencies. For all of them, we ran UDPipe models (Straka and Strakov´ a, 2017) to obtain predicted segmentation and tokenization. We also computed predicted PoS tags, but they were not used (nor gold PoS tags were) to train any of the models, but just to decode the labels from the rel-PoS encoding (Strzyz et al., 2019b). In addition, we included dummy beginning- and end-of-sentence tokens (BOS,EOS) as in previous work in parsing as labeling.
Model For our experiments we use bidirectional long short-term memory networks (Hochreiter and Schmidhuber, 1997; Schuster and Paliwal, 1997) as implemented in the NCRF++ framework (Yang and Zhang, 2018). 1 Each input word w i is represented as a vector which comes from a concatenation of (i) an external pre-trained word embedding, which is further fine-tuned during training, and (ii) a second word embedding which results from the output of a char-LSTM, which is trained end-to-end together with the rest of the network.
In this context, let LSTM θ ( /vector x ) be a black-box long short-term memory network that processes the sequence of vectors /vector x = [ /vector x 1 , ..., /vector x | /vector x | ] , then the output for /vector x i is a hidden vector /vector h i which represents the word based on its left and right sentence context:
<!-- formula-not-decoded -->
More particularly, we stack 2 BiLSTMs before computing the output layer. For this, we consider a simple hard-sharing multi-task learning architecture, where each /vector h i is sent to three separate layers in order to generate the classifications through regular softmaxes: two labels predicted for each plane (one label per plane) 2 , we generate an empty label â . and another one for the word's dependency relation. Afterwards, label decoding is followed by a postprocessing step with some heuristics to ensure a valid dependency tree (as described in §4).
## 5.1 Analysis and results
Next, we compare the performance of the original bracketing encoding (1p-brackets), 2-planar with greedy plane assignment (2p-greedy) and 2-planar with restriction propagation (2p-prop) with respect to their theoretical arc coverage, as well as their empirical recall and precision. For UAS/LAS, we also report results for models trained on the rel-PoS encoding.
Table 2: Percentage of arcs covered by the proposed encodings on the gold training set from highly non-projective treebanks.
| Language | 1p-brackets | 2p-greedy | 2p-prop | Language | 1p-brackets | 2p-greedy | 2p-prop |
|-----------------------|---------------|-------------|-----------|---------------------|---------------|-------------|-----------|
| Ancient Greek Perseus | 89.53 | 99.27 | 99.33 | Afrikaans AfriBooms | 98.65 | 99.99 | 99.99 |
| Basque BDT | 94.85 | 99.85 | 99.62 | Korean Kaist | 98.42 | 100 | 100 |
| Hungarian Szeged | 97.57 | 99.96 | 99.98 | Danish DDT | 98.1 | 99.97 | 99.96 |
| Portuguese Bosque | 98.1 | 99.95 | 99.88 | Gothic PROIEL | 97.58 | 99.94 | 99.98 |
| Urdu UDTB | 98.68 | 99.95 | 99.94 | Lithuanian HSE | 98.35 | 99.97 | 100 |
Theoretical advantage Table 2 compares the dependency arc coverage by the encodings on the gold training sets. It is easy to conclude that the 2-planar encodings almost fully succeed to reconstruct highly non-projective datasets, while the bracketing encoding suffers more. When comparing both plane assignments for the 2-planar encodings, we see that the coverage of 2p-greedy is already so high (99.9% or more in all but two treebanks) that the extra coverage provided by 2p-prop is not large in absolute terms. In fact, in some treebanks, 2p-prop even has slightly less measured coverage than 2p-greedy, even though (as explained earlier) the former guarantees full coverage of 2-planar trees while the latter does not. This can be explained because there are non-2-planar trees where 2p-greedy happens to cover more arcs. In such trees, the theoretical guarantee provided by 2p-prop does not apply.
1 We omit however the CRF on top of the BiLSTMs.
2 If a given word has no arcs associated to that plane
Table 3: Label size of each encoding based on the training and dev set for each treebank. Each task contains three additional labels: BOS, EOS and â . Hence the Japanese and Galician treebanks have three labels for the second plane, although they are fully projective.
| Language | Encoding | Task 1 (1st plane) | Task 2 (2nd plane) | Task 3 (deprel) | Language | Encoding | Task 1 (1st plane) | Task 2 (2nd plane) | Task 3 (deprel) | Language | Encoding | Task 1 (1st plane) | Task 2 (2nd plane) | Task 3 (deprel) |
|---------------|-------------|----------------------|----------------------|-------------------|------------|-------------|----------------------|----------------------|-------------------|------------|-------------|----------------------|----------------------|-------------------|
| Ancient Greek | rel-PoS | 166 | - | 27 | Urdu | rel-PoS | 190 | - | 27 | Gothic | rel-PoS | 121 | - | 34 |
| Ancient Greek | 1p-brackets | 210 | - | 27 | Urdu | 1p-brackets | 95 | - | 27 | Gothic | 1p-brackets | 114 | - | 34 |
| Ancient Greek | 2p-greedy | 108 | 37 | 27 | Urdu | 2p-greedy | 80 | 22 | 27 | Gothic | 2p-greedy | 78 | 18 | 34 |
| Ancient Greek | 2p-prop | 109 | 39 | 27 | Urdu | 2p-prop | 80 | 22 | 27 | Gothic | 2p-prop | 78 | 19 | 34 |
| Basque | rel-PoS | 132 | - | 32 | Afrikaans | rel-PoS | 110 | - | 28 | Lithuanian | rel-PoS | 89 | - | 38 |
| Basque | 1p-brackets | 134 | - | 32 | Afrikaans | 1p-brackets | 77 | - | 28 | Lithuanian | 1p-brackets | 57 | - | 38 |
| Basque | 2p-greedy | 84 | 25 | 32 | Afrikaans | 2p-greedy | 62 | 15 | 28 | Lithuanian | 2p-greedy | 46 | 11 | 38 |
| Basque | 2p-prop | 83 | 25 | 32 | Afrikaans | 2p-prop | 62 | 15 | 28 | Lithuanian | 2p-prop | 46 | 12 | 38 |
| Hungarian | rel-PoS | 128 | - | 56 | Korean | rel-PoS | 134 | - | 32 | Japanese | rel-PoS | 77 | - | 27 |
| Hungarian | 1p-brackets | 101 | - | 56 | Korean | 1p-brackets | 89 | - | 32 | Japanese | 1p-brackets | 45 | - | 27 |
| Hungarian | 2p-greedy | 71 | 19 | 56 | Korean | 2p-greedy | 73 | 14 | 32 | Japanese | 2p-greedy | 45 | 3 | 27 |
| Hungarian | 2p-prop | 71 | 21 | 56 | Korean | 2p-prop | 73 | 14 | 32 | Japanese | 2p-prop | 45 | 3 | 27 |
| Portuguese | rel-PoS | 192 | - | 43 | Danish | rel-PoS | 150 | - | 38 | Galician | rel-PoS | 132 | - | 26 |
| Portuguese | 1p-brackets | 110 | - | 43 | Danish | 1p-brackets | 128 | - | 38 | Galician | 1p-brackets | 82 | - | 26 |
| Portuguese | 2p-greedy | 88 | 25 | 43 | Danish | 2p-greedy | 97 | 23 | 38 | Galician | 2p-greedy | 82 | 3 | 26 |
| Portuguese | 2p-prop | 88 | 27 | 43 | Danish | 2p-prop | 96 | 25 | 38 | Galician | 2p-prop | 82 | 3 | 26 |
With respect to the number of labels that each encoding generates (which will directly impact the output size of the softmax layers), Table 3 shows the comparison of the output vocabulary sets for each of the tasks in the multi-task learning setup. We can see that, for most languages, bracketing encodings generate a smaller tag set than rel-PoS 3 ; and in general, the 2-planar encodings do not produce increases in tagset size with respect to the 1-planar bracketing encoding. In fact, for the most non-projective languages (like Ancient Greek or Basque), the 2-planar encodings clearly compress the tag set as, in spite of having a larger variety of brackets, they appear distributed among the two planes so that the bracket strings in each label will tend to be shorter.
Table 4: Models' precision and recall of non-projective sentences on the test set.
| Language | 1p-brackets | 1p-brackets | 2p-greedy | 2p-greedy | 2p-prop | 2p-prop |
|-----------------------|---------------|---------------|-------------|-------------|-----------|-----------|
| | P | R | P | R | P | R |
| Ancient Greek Perseus | 85.74 | 54.34 | 86.33 | 63.85 | 87.58 | 66.23 |
| Basque BDT | 69.87 | 45.80 | 70.14 | 52.97 | 72.77 | 52.80 |
| Hungarian Szeged | 37.17 | 66.98 | 35.51 | 71.70 | 37.80 | 74.53 |
| Portuguese Bosque | 52.94 | 24.77 | 55.84 | 39.45 | 61.64 | 41.28 |
| Urdu UDTB | 36.63 | 36.63 | 38.10 | 31.68 | 39.78 | 36.63 |
| Afrikaans AfriBooms | 40.99 | 65.35 | 46.72 | 63.37 | 46.94 | 68.32 |
| Korean Kaist | 59.45 | 49.54 | 62.24 | 47.03 | 62.80 | 47.03 |
| Danish DDT | 45.54 | 48.57 | 46.36 | 48.57 | 45.37 | 46.67 |
| Gothic PROIEL | 50.50 | 26.42 | 58.88 | 32.64 | 56.00 | 36.27 |
| Lithuanian HSE | 34.38 | 91.67 | 27.59 | 66.67 | 33.33 | 83.33 |
| Average | 51.32 | 51.01 | 52.77 | 51.79 | 54.40 | 55.31 |
Results To investigate how the coverage in Table 2 translates into non-projective performance in actual parsing, we report models' precision and recall. In Table 4, the precision and recall on non-projective sentences 4 increase across the treebanks with 2-planar models, suggesting that they are capable of identifying non-projective sentences to a greater extent than the original bracketing model. Table 5 shows that 2p-greedy and 2p-prop models improve the recall and precision of non-projective dependencies in the majority of treebanks. 5 Again, 2-planar encodings outperform the original bracketing baseline, even
In the rel-PoS encoding, each label word represents the head based on a PoS-tag offset. See also (Strzyz et al., 2019b). 4 Precision and recall on non-projective sentences are computed by looking whether a given sentence is identified as nonprojective (i.e. given a non-projective parse), disregarding the correctness of the predicted non-projective dependencies for that sentence.
5 The reported precision and recall for Lithuanian is lower than for the other treebanks. As we show in Appendix A, the Lithuanian treebank contains a small number of sentences and therefore it is hard to draw robust conclusions about its poor performance.
Table 5: Models' precision and recall of non-projective dependencies on the test set.
| Language | 1p-brackets | 1p-brackets | 2p-greedy | 2p-greedy | 2p-prop | 2p-prop |
|-----------------------|---------------|---------------|-------------|-------------|-----------|-----------|
| | P | R | P | R | P | R |
| Ancient Greek Perseus | 20.82 | 10.32 | 32.40 | 18.65 | 31.40 | 19.16 |
| Basque BDT | 18.41 | 11.80 | 28.11 | 19.83 | 31.76 | 20.40 |
| Hungarian Szeged | 1.57 | 4.05 | 3.13 | 9.25 | 4.06 | 10.98 |
| Portuguese Bosque | 10.87 | 5.18 | 14.50 | 9.84 | 20.18 | 11.92 |
| Urdu UDTB | 3.26 | 3.92 | 0.69 | 0.65 | 3.41 | 3.92 |
| Afrikaans AfriBooms | 11.04 | 18.09 | 13.09 | 19.15 | 12.59 | 18.62 |
| Korean Kaist | 28.12 | 21.68 | 32.26 | 21.37 | 31.06 | 21.53 |
| Danish DDT | 7.66 | 11.35 | 12.96 | 19.86 | 9.45 | 13.48 |
| Gothic PROIEL | 11.11 | 5.17 | 19.63 | 11.03 | 17.46 | 11.38 |
| Lithuanian HSE | 0 | 0 | 0 | 0 | 1.64 | 6.25 |
| Average | 11.29 | 9.16 | 15.68 | 12.96 | 16.30 | 13.76 |
though the latter is able to cover non-projectivity to some degree (crossing arcs pointing in opposite directions). Both 2p-greedy and 2p-prop obtain similar scores, showing that their coverage is comparable.
Table 6: UAS and LAS (%) for the respective encodings on the predicted dev and test set of highly non-projective treebanks and control treebanks.
| Language | Encoding | dev | dev | test | test | Encoding | dev | dev | test | test |
|------------|-------------|-------|-------|--------|--------|-------------|-------|-------|--------|--------|
| Language | Encoding | UAS | LAS | UAS | LAS | Encoding | UAS | LAS | UAS | LAS |
| | rel-PoS | 65.29 | 58.27 | 62.91 | 55.07 | rel-PoS | 81.47 | 78.50 | 77.25 | 73.92 |
| | 1p-brackets | 64.70 | 57.21 | 63.36 | 54.80 | 1p-brackets | 84.54 | 81.54 | 82.37 | 79.03 |
| | 2p-greedy | 67.10 | 59.97 | 65.90 | 57.15 | 2p-greedy | 85.01 | 82.01 | 82.33 | 78.91 |
| | 2p-prop | 67.06 | 59.84 | 65.11 | 56.55 | 2p-prop | 84.65 | 81.73 | 82.32 | 79.03 |
| | rel-PoS | 77.48 | 72.91 | 75.28 | 70.19 | rel-PoS | 78.28 | 74.93 | 77.07 | 73.45 |
| | 1p-brackets | 80.13 | 75.37 | 78.37 | 72.95 | 1p-brackets | 80.60 | 76.59 | 78.25 | 73.94 |
| | 2p-greedy | 79.98 | 75.18 | 78.13 | 72.63 | 2p-greedy | 80.68 | 76.80 | 78.49 | 74.07 |
| | 2p-prop | 80.44 | 75.56 | 78.58 | 73.08 | 2p-prop | 81.15 | 77.27 | 78.87 | 74.42 |
| | rel-PoS | 72.58 | 67.13 | 66.19 | 59.32 | rel-PoS | 65.25 | 58.58 | 67.14 | 59.72 |
| | 1p-brackets | 75.09 | 69.13 | 67.80 | 60.50 | 1p-brackets | 65.26 | 57.92 | 66.63 | 59.02 |
| | 2p-greedy | 75.47 | 69.33 | 68.07 | 60.74 | 2p-greedy | 65.26 | 58.05 | 66.84 | 59.26 |
| | 2p-prop | 75.26 | 69.05 | 67.95 | 60.63 | 2p-prop | 65.29 | 57.91 | 66.25 | 58.41 |
| | rel-PoS | 87.10 | 84.28 | 84.74 | 81.02 | rel-PoS | 39.04 | 26.37 | 31.05 | 19.70 |
| | 1p-brackets | 88.88 | 85.78 | 86.67 | 82.44 | 1p-brackets | 40.97 | 25.63 | 34.62 | 19.42 |
| | 2p-greedy | 88.88 | 85.76 | 86.51 | 82.39 | 2p-greedy | 41.34 | 26.46 | 35.08 | 20.45 |
| | 2p-prop | 89.00 | 85.82 | 86.52 | 82.17 | 2p-prop | 44.19 | 29.03 | 34.80 | 21.29 |
| | rel-PoS | 80.98 | 75.09 | 81.18 | 75.26 | rel-PoS | 76.60 | 75.83 | 74.83 | 73.96 |
| | 1p-brackets | 84.22 | 77.23 | 84.28 | 77.19 | 1p-brackets | 78.73 | 77.67 | 77.34 | 76.10 |
| | 2p-greedy | 84.01 | 77.16 | 84.08 | 77.19 | 2p-greedy | 78.81 | 77.78 | 77.47 | 76.24 |
| | 2p-prop | 83.89 | 77.30 | 84.26 | 77.41 | 2p-prop | 78.81 | 77.78 | 77.47 | 76.24 |
| | rel-PoS | 79.00 | 74.58 | 78.93 | 74.65 | rel-PoS | 79.72 | 76.40 | 78.36 | 75.05 |
| | 1p-brackets | 80.77 | 75.54 | 79.52 | 74.86 | 1p-brackets | 80.82 | 77.35 | 80.02 | 76.33 |
| | 2p-greedy | 81.41 | 76.33 | 80.13 | 75.53 | 2p-greedy | 80.90 | 77.36 | 79.91 | 76.32 |
| | 2p-prop | 81.50 | 76.30 | 79.96 | 75.43 | 2p-prop | 80.90 | 77.36 | 79.91 | 76.32 |
Table 6 compares the LAS and UAS performance of the 1- and 2-planar, and also of the rel-PoS encoding. 6 2-planar encodings outperform the existing bracketing encoding in the majority of treebanks. The gains vary between languages but on average 2p-greedy improves UAS by 0 . 4 and 2p-prop by
6 Note that the implementations compared here do not use PoS tags as features. This is sensible for the bracketing encodings, and for the focus of this paper where we are interested in encodings that can be run using raw words as the only input, but differs from the standard setup in the rel-PoS encoding (where using PoS tags comes at no extra significant cost, because they need to be computed for decoding in any case).
0 . 3 , and both improve LAS by 0 . 4 across highly non-projective treebanks. Comparing both assignment strategies for the 2-planar encoding, the theoretical advantage in coverage provided by 2p-prop over 2pgreedy does not translate into accuracy gains in general, as the actual difference in coverage is small when measured in the treebanks (as was seen in Table 2) and the simpler greedy assignment strategy is likely to be easier to learn by the machine learning setup.
Since the syntactic dependencies are represented by a finite set of labels that have been seen in the training and development sets, as in all parsing as sequence labeling approaches, it is expected that at test time our model may encounter unseen labels. In Appendix B we show the label coverage of all encodings on the test set. In general, it seems that the unseen labels do not have significant impact on the overall performance due to their rare occurrence.
Table 7: Comparison of parsing speeds (sent/s) on a single core CPU and GPU. The reported speeds are averaged over 5 runs. Times with a * are a reminder that the rel-PoS encoding additionally requires PoS tagging, whose time is not included in these speeds. As an example, the UDPipe tagging speed on the test set (in sent/s) is: Ancient Greek567 , Basque881 , Hungarian755 , Portuguese210 , Urdu45 , Afrikaans465 , Korean921 , Danish671 , Gothic861 , Lithuanian1418 , Japanese767 and Galician366 .
| Encoding | sent/s | sent/s | | Encoding | sent/s | sent/s | Language | Encoding | sent/s | sent/s |
|-------------|----------|----------|---------------------|-------------|----------|----------|----------------|-------------|----------|----------|
| Encoding | CPU | GPU | | Encoding | CPU | GPU | Language | Encoding | CPU | GPU |
| rel-PoS | 305* | 1012* | Urdu UDTB | rel-PoS | 182* | 625* | Gothic PROIEL | rel-PoS | 430* | 1266* |
| 1p-brackets | 303 | 1011 | Urdu UDTB | 1p-brackets | 186 | 616 | Gothic PROIEL | 1p-brackets | 429 | 1269 |
| 2p-greedy | 288 | 889 | Urdu UDTB | 2p-greedy | 174 | 544 | Gothic PROIEL | 2p-greedy | 414 | 1175 |
| 2p-prop | 289 | 886 | Urdu UDTB | 2p-prop | 175 | 549 | Gothic PROIEL | 2p-prop | 412 | 1186 |
| rel-PoS | 387* | 1461* | Afrikaans AfriBooms | rel-PoS | 228* | 861* | Lithuanian HSE | rel-PoS | 239* | 828* |
| 1p-brackets | 388 | 1454 | Afrikaans AfriBooms | 1p-brackets | 228 | 857 | Lithuanian HSE | 1p-brackets | 235 | 769 |
| 2p-greedy | 378 | 1369 | Afrikaans AfriBooms | 2p-greedy | 220 | 805 | Lithuanian HSE | 2p-greedy | 228 | 740 |
| 2p-prop | 378 | 1369 | Afrikaans AfriBooms | 2p-prop | 221 | 805 | Lithuanian HSE | 2p-prop | 229 | 730 |
| rel-PoS | 219* | 802* | | rel-PoS | 442* | 1718* | Japanese GSD | rel-PoS | 214* | 663* |
| 1p-brackets | 221 | 797 | | 1p-brackets | 447 | 1718 | Japanese GSD | 1p-brackets | 214 | 661 |
| 2p-greedy | 212 | 739 | | 2p-greedy | 434 | 1598 | Japanese GSD | 2p-greedy | 206 | 611 |
| 2p-prop | 213 | 750 | | 2p-prop | 435 | 1544 | Japanese GSD | 2p-prop | 205 | 616 |
| rel-PoS | 242* | 868* | | rel-PoS | 279* | 962* | Galician CTG | rel-PoS | 175* | 752* |
| 1p-brackets | 246 | 872 | | 1p-brackets | 280 | 980 | Galician CTG | 1p-brackets | 177 | 756 |
| 2p-greedy | 236 | 811 | | 2p-greedy | 265 | 841 | Galician CTG | 2p-greedy | 170 | 673 |
| 2p-prop | 237 | 814 | | 2p-prop | 264 | 832 | Galician CTG | 2p-prop | 170 | 673 |
Finally, we measured speeds for each of the encodings on various treebanks, run on a single core CPU 7 and GPU 8 , which we breakdown in Table 7. We can observe that the speed is very similar between 1planar and 2-planar encodings. This is because the bottleneck of the model is in the BiLSTMs, and computing the softmaxes comes at almost no cost despite the differences in the output vocabularies.
## 6 Conclusion
We have shown a new bracketing-based linearization of 2-planar trees compatible with parsing as sequence labeling. Our main goal was to introduce a bracketing encoding with the ability to perform unrestricted non-projective dependency parsing, which remained as an open challenge in sequence labeling parsing under the family of bracketing encodings. Together with the proposed plane assignment strategies and a BiLSTM-based network, our 2-planar bracket representations improve the performance over the existing bracketing-based encoding for parsing as sequence labeling, and also outperform the PoS-based encoding in the absence of PoS-tags as input parameters to the model. Thus, it can be a useful alternative where an encoding that depends on PoS tags is not desirable, e.g. domains with low-frequency or low-quality PoS tags, or to decrease even further the latency of sequence labeling parsers.
Finally, it is worth noting that we have proposed plane assignment strategies that minimize the use of the second plane. However, it is a possible avenue for future work to examine other strategies based on different criteria than the one presented in this paper.
7 For CPU experiments, we used a CPU core Intel Core i7-8700 CPU 3.2 GHz.
8 For GPU experiments, we used an Nvidia TITAN Xp.
## Acknowledgements
This work has received funding from the European Research Council (ERC), which has funded this research under the European Union's Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from MINECO (ANSWER-ASAP, TIN2017-85160-C2-1-R), from Xunta de Galicia (ED431C 2020/11), and from Centro de Investigaci´ on de Galicia 'CITIC', funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. DV is supported by a 2020 Leonardo Grant for Researchers and Cultural Creators from the BBVA Foundation.
## References
- Mark Anderson and Carlos G´ omez-Rodr´ Ĺguez. 2020. On the Frailty of Universal POS Tags for Neural UD Parsers. In Proceedings of the 24th Conference on Computational Natural Language Learning . To appear.
- Liang-Fang Chao and Edwin Hsing-Mean Sha. 1992. Algorithms for min-cut linear arrangements of outerplanar graphs. In [Proceedings] 1992 IEEE International Symposium on Circuits and Systems , volume 4, pages 18511854 vol.4.
- Miryam de Lhoneux, Yan Shao, Ali Basirat, Eliyahu Kiperwasser, Sara Stymne, Yoav Goldberg, and Joakim Nivre. 2017. From raw text to universal dependencies - look, no tags! In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 207-217, Vancouver, Canada, August. Association for Computational Linguistics.
- Daniel Fern´ andez-Gonz´ alez and Carlos G´ omez-Rodr´ Ĺguez. 2018. A dynamic oracle for linear-time 2-planar dependency parsing. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages 386-392, New Orleans, Louisiana, June. Association for Computational Linguistics.
- Ramon Ferrer-i-Cancho, Carlos G´ omez-Rodr´ Ĺguez, and Juan Luis Esteban. 2018. Are crossing dependencies really scarce? Physica A: Statistical Mechanics and its Applications , 493:311-329.
- Carlos G´ omez-Rodr´ Ĺguez and Joakim Nivre. 2010. A transition-based parser for 2-planar dependency structures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics , pages 1492-1501, Uppsala, Sweden, July. Association for Computational Linguistics.
- Carlos G´ omez-Rodr´ Ĺguez and Joakim Nivre. 2013. Divisible transition systems and multiplanar dependency parsing. Computational Linguistics , 39(4):799-845.
- Carlos G´ omez-Rodr´ Ĺguez and David Vilares. 2018. Constituent parsing as sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 1314-1324, Brussels, Belgium, October-November. Association for Computational Linguistics.
- Carlos G´ omez-Rodr´ Ĺguez. 2016. Restricted non-projectivity: Coverage vs. efficiency. Computational Linguistics , 42(4):809-817, December.
- Sepp Hochreiter and J¨ urgen Schmidhuber. 1997. Long short-term memory. Neural computation , 9(8):1735-1780.
- Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2676-2686, Melbourne, Australia, July. Association for Computational Linguistics.
- Nikita Kitaev and Dan Klein. 2019. Tetra-tagging: Word-synchronous parsing with linear-time inference. CoRR ,
- abs/1904.09745.
- Marco Kuhlmann and Peter Jonsson. 2015. Parsing to noncrossing dependency graphs. Transactions of the Association for Computational Linguistics , 3:559-570.
- Oph´ elie Lacroix. 2019. Dependency parsing as sequence labeling with head-based encoding and multi-task learning. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019) , pages 136-143, Paris, France, August. Association for Computational Linguistics.
- Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao. 2018. Seq2seq dependency parsing. In Proceedings of the 27th International Conference on Computational Linguistics , pages 3203-3214, Santa Fe, New Mexico, USA, August. Association for Computational Linguistics.
- Joakim Nivre et al. 2019. Universal dependencies 2.4. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics ( ´ UFAL), Faculty of Mathematics and Physics, Charles University.
- Emily Pitler, Sampath Kannan, and Mitchell Marcus. 2013. Finding optimal 1-endpoint-crossing trees. Transactions of the Association for Computational Linguistics , 1:13-24.
- Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing , 45(11):2673-2681.
- Aaron Smith, Miryam de Lhoneux, Sara Stymne, and Joakim Nivre. 2018. An investigation of the interactions between pre-trained word embeddings, character models and POS tags in dependency parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 2711-2720, Brussels, Belgium, October-November. Association for Computational Linguistics.
- Drahom´ Ĺra Spoustov´ a and Miroslav Spousta. 2010. Dependency parsing as a sequence labeling task. The Prague Bulletin of Mathematical Linguistics , 94(1):7-14.
- Milan Straka and Jana Strakov´ a. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , pages 88-99, Vancouver, Canada, August. Association for Computational Linguistics.
- Michalina Strzyz, David Vilares, and Carlos G´ omez-Rodr´ Ĺguez. 2019a. Sequence labeling parsing by learning across representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 5350-5357, Florence, Italy, July. Association for Computational Linguistics.
- Michalina Strzyz, David Vilares, and Carlos G´ omez-Rodr´ Ĺguez. 2019b. Viable dependency parsing as sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 717-723, Minneapolis, Minnesota, June. Association for Computational Linguistics.
- David Vilares, Michalina Strzyz, Anders Søgaard, and Carlos G´ omez-Rodr´ Ĺguez. 2020. Parsing as pretraining. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) , New York, NY, USA.
- Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , NIPS'15, pages 2773-2781, Cambridge, MA, USA. MIT Press.
- Jie Yang and Yue Zhang. 2018. Ncrf++: An open-source neural sequence labeling toolkit. In Proceedings of ACL 2018, System Demonstrations , pages 74-79.
- Anssi Yli-Jyr¨ a and Carlos G´ omez-Rodr´ Ĺguez. 2017. Generic axiomatization of families of noncrossing graphs in dependency parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1745-1755, Vancouver, Canada, July. Association for Computational Linguistics.
- Anssi Mikael Yli-Jyr¨ a. 2003. Multiplanarity - a model for dependency structures in treebanks. In Joakim Nivre and Erhard Hinrichs, editors, TLT 2003. Proceedings of the Second Workshop on Treebanks and Linguistic Theories , volume 9 of Mathematical Modelling in Physics, Engineering and Cognitive Sciences , pages 189200, V¨ axj¨ o, Sweden. V¨ axj¨ o University Press.
## A Treebank sizes
Weprovide some statistics about the chosen treebanks. In Table 8, we report the total number of sentences for each dataset split with their respective non-projectivity percentage.
Table 8: Total number of sentences per each dataset split (% non-projective sentences).
| Language | Train | Dev | Test | Language | Train | Dev | Test |
|-----------------------|------------------|-----------------|-----------------|---------------------|------------------|-----------------|-----------------|
| Ancient Greek Perseus | 11476 ( 62.77% ) | 1137 ( 74.41% ) | 1306 ( 64.40% ) | Afrikaans AfriBooms | 1315 ( 22.21% ) | 194 ( 20.10% ) | 425 ( 23.76% ) |
| Basque BDT | 5396 ( 33.52% ) | 1798 ( 33.48% ) | 1799 ( 31.80% ) | Korean Kaist | 23010 ( 21.92% ) | 2066 ( 22.12% ) | 2287 ( 19.15% ) |
| Hungarian Szeged | 910 ( 25.71% ) | 441 ( 33.56% ) | 449 ( 23.61% ) | Danish DDT | 4383 ( 21.83% ) | 564 ( 21.81% ) | 565 ( 18.58% ) |
| Portuguese Bosque | 8328 ( 23.60% ) | 560 ( 19.46% ) | 477 ( 22.85% ) | Gothic PROIEL | 3387 ( 16.77% ) | 985 ( 19.09% ) | 1029 ( 18.76% ) |
| Urdu UDTB | 4043 ( 23.00% ) | 552 ( 23.01% ) | 535 ( 18.88% ) | Lithuanian HSE | 153 ( 16.34% ) | 55 ( 16.36% ) | 55 ( 21.82% ) |
## B Label coverage
At test time, our model assigns a label for each task by choosing one from a finite set learned during training. As a result, it is expected that the model may not be able to predict some of the labels occurring in the test set. Table 9 reports the number of labels that have not been seen in the training and dev set and the total number of unique labels found in the test set. In addition, we include data about the percentage of occurrences of unseen labels with respect to the the occurrences of all labels in the test set.
Table 9: Label coverage in each task at test time.
| | | Task 1 (1st plane) | Task 1 (1st plane) | Task 1 (1st plane) | Task 2 (2nd plain) | Task 2 (2nd plain) | Task 2 (2nd plain) | Task 3 (deprel) | Task 3 (deprel) | Task 3 (deprel) |
|---------------------|-------------|----------------------|----------------------|----------------------|----------------------|----------------------|----------------------|-------------------|-------------------|-------------------|
| Language | Encoding | Unseen | Total | % occ. | Unseen | Total | % occ. | Unseen | Total | % occ. |
| | rel-PoS | 0 ( 0% ) | 75 | 0 | - | - | - | 0 ( 0% ) | 25 | 0 |
| Ancient | 1p-brackets | 4 ( 3.39% ) | 118 | 0 . 02 | - | - | - | 0 ( 0% ) | 25 | 0 |
| Greek Perseus | 2p-greedy | 2 ( 3.08% ) | 65 | 0.01 | 1 ( 5.26% ) | 19 | 0.01 | 0 ( 0% ) | 25 | 0 |
| | 2p-prop | 1 ( 1.49% ) | 67 | 0 | 0 ( 0% ) | 22 | 0 | 0 ( 0% ) | 25 | 0 |
| | rel-PoS | 4 ( 4.3% ) | 93 | 0 . 02 | - | - | - | 0 ( 0% ) | 30 | 0 |
| | 1p-brackets | 2 ( 2.15% ) | 93 | 0 . 01 | - | - | - | 0 ( 0% ) | 30 | 0 |
| Basque BDT | 2p-greedy | 3 ( 4.35% ) | 69 | 0.02 | 2 ( 10.53% ) | 19 | 0.01 | 0 ( 0% ) | 30 | 0 |
| | 2p-prop | 3 ( 4.35% ) | 69 | 0.02 | 2 ( 10.0% ) | 20 | 0.01 | 0 ( 0% ) | 30 | 0 |
| | rel-PoS | 6 ( 7.5% ) | 80 | 0 . 06 | - | - | - | 0 ( 0% ) | 47 | 0 |
| | 1p-brackets | 5 ( 6.1% ) | 82 | 0 . 05 | - | - | - | 0 ( 0% ) | 47 | 0 |
| Hunagrian Szeged | 2p-greedy | 1 ( 1.64% ) | 61 | 0.01 | 1 ( 8.33% ) | 12 | 0.01 | 0 ( 0% ) | 47 | 0 |
| | 2p-prop | 1 ( 1.64% ) | 61 | 0.01 | 1 ( 6.67% ) | 15 | 0.01 | 0 ( 0% ) | 47 | 0 |
| | rel-PoS | 2 ( 2.11% ) | 95 | 0 . 03 | - | - | - | 0 ( 0% ) | 38 | 0 |
| | 1p-brackets | 0 ( 0% ) | 60 | 0 | - | - | - | 0 ( 0% ) | 38 | 0 |
| Portuguese Bosque | 2p-greedy | 0 ( 0% ) | 54 | 0 | 0 ( 0% ) | 14 | 0 | 0 ( 0% ) | 38 | 0 |
| | 2p-prop | 0 ( 0% ) | 54 | 0 | 0 ( 0% ) | 14 | 0 | 0 ( 0% ) | 38 | 0 |
| | rel-PoS | 8 ( 7.69% ) | 104 | 0 . 07 | - | - | - | 0 ( 0% ) | 24 | 0 |
| | 1p-brackets | 4 ( 6.06% ) | 66 | 0 . 03 | - | - | - | 0 ( 0% ) | 24 | 0 |
| Urdu UDTB | 2p-greedy | 3 ( 5.36% ) | 56 | 0.02 | 0 ( 0% ) | 12 | 0 | 0 ( 0% ) | 24 | 0 |
| | 2p-prop | 3 ( 5.36% ) | 56 | 0.02 | 0 ( 0% ) | 14 | 0 | 0 ( 0% ) | 24 | 0 |
| | rel-PoS | 2 ( 2.82% ) | 71 | 0 . 02 | - | - | - | 0 ( 0% ) | 26 | 0 |
| | 1p-brackets | 4 ( 6.06% ) | 66 | 0 . 04 | - | - | - | 0 ( 0% ) | 26 | 0 |
| Afrikaans AfriBooms | 2p-greedy | 1 ( 1.92% ) | 52 | 0.01 | 1 ( 10.0% ) | 10 | 0.01 | 0 ( 0% ) | 26 | 0 |
| | 2p-prop | 1 ( 1.89% ) | 53 | 0.01 | 1 ( 9.09% ) | 11 | 0.01 | 0 ( 0% ) | 26 | 0 |
| | rel-PoS | 1 ( 1.11% ) | 90 | 0 | - | - | - | 1 ( 3.23% ) | 31 | 0 |
| Korean Kaist | 1p-brackets | 1 ( 1.59% ) | 63 | 0 | - | - | - | 1 ( 3.23% ) ) | 31 | 0 |
| | 2p-greedy | 0 ( 0% ) | 56 | 0 | 0 ( 0% ) | 8 | 0 | 1 ( 3.23% | 31 | 0 |
| | 2p-prop | 0 ( 0% ) | 56 | 0 | 0 ( 0% ) | 8 | 0 | 1 ( 3.23% ) | 31 | 0 |
| | rel-PoS | 2 ( 2.25% ) | 89 | 0 . 02 | - | - | - | 0 ( 0% ) | 34 | 0 |
| | 1p-brackets | 0 ( 0% ) | 72 | 0 | - | - | - | 0 ( 0% ) | 34 | 0 |
| Danish DDT | 2p-greedy | 0 ( 0% ) | 63 | 0 | 0 ( 0% ) | 12 | 0 | 0 ( 0% ) | 34 | 0 |
| | 2p-prop | 0 ( 0% ) | 63 | 0 | 0 ( 0% ) | 12 | 0 | 0 ( 0% ) | 34 | 0 |
| | rel-PoS | 3 ( 4.11% ) | 73 | 0 . 03 | - | - | - | 0 ( 0% ) | 31 | 0 |
| Gothic PROIEL | 1p-brackets | 2 ( 2.78% ) | 72 | 0 . 02 | - | - | - | 0 ( 0% ) | 31 | 0 |
| | 2p-greedy | 1 ( 1.79% ) | 56 | 0.01 | 2 ( 13.33% ) ) | 15 | 0.02 | 0 ( 0% ) | 31 | 0 |
| | 2p-prop | 1 ( 1.79% ) | 56 | 0.01 | 2 ( 13.33% | 15 | 0.02 | 0 ( 0% ) | 31 | 0 |
| | rel-PoS | 2 ( 4.17% ) | 48 | 0 . 19 | - | - | - | 1 ( 3.12% ) | 32 | 0 . 09 |
| | 1p-brackets | 7 ( 15.91% ) | 44 | 0 . 66 | - | - | - | 1 ( 3.12% ) | 32 | 0 . 09 |
| Lithuanian HSE | 2p-greedy | 3 ( 8.11% ) | 37 | 0.38 | 1 ( 16.67% ) | 6 | 0.09 | 1 ( 3.12% ) | 32 | 0.09 |
| | 2p-prop | 3 ( 8.11% ) | 37 | 0.38 | 1 ( 16.67% ) | 6 | 0.09 | 1 ( 3.12% | | |
| | | | | | | | | ) | 32 | 0.09 |