Commit ede189dd authored by Christos Christodoulopoulos's avatar Christos Christodoulopoulos
Browse files

v0.2 Released

- Added context from .CHA files (requires a separate .context file; see resource dir)
- Annotation resumes from last edited/saved example
- Colour-coded annotator box: red->automatic annotation, green->previously annotated data, grey->manually annotated
- tregex/tree surgeon integration (saves a separate .mrg.<annnotator> file; see treebank dir)
parent 4485ddf3
Branch of the Jubilee project (http://code.google.com/p/propbank/) to deal with new babySRL annotation.
v0.2 Released
- Added context from .CHA files (requires a separate .context file; see resource dir)
- Annotation resumes from last edited/saved example
- Colour-coded annotator box: red->automatic annotation, green->previously annotated data, grey->manually annotated
- tregex/tree surgeon integration (saves a separate .mrg.<annnotator> file; see treebank dir)
TREGEX
-----------------------------------------------
Tregex Pattern Syntax and Uses
Using a Tregex pattern, you can find only those trees that match the pattern you're
looking for. The following table shows the symbols that are allowed in the pattern,
and below there is more information about using these patterns.
Table of Symbols and Meanings:
A << B
A dominates B
A >> B
A is dominated by B
A < B
A immediately dominates B
A > B
A is immediately dominated by B
A $ B
A is a sister of B (and not equal to B)
A .. B
A precedes B
A . B
A immediately precedes B
A ,, B
A follows B
A , B
A immediately follows B
A <<, B
B is a leftmost descendent of A
A <<- B
B is a rightmost descendent of A
A >>, B
A is a leftmost descendent of B
A >>- B
A is a rightmost descendent of B
A <, B
B is the first child of A
A >, B
A is the first child of B
A <- B
B is the last child of A
A >- B
A is the last child of B
A <` B
B is the last child of A
A >` B
A is the last child of B
A <i B
B is the ith child of A (i > 0)
A >i B
A is the ith child of B (i > 0)
A <-i B
B is the ith-to-last child of A (i > 0)
A >-i B
A is the ith-to-last child of B (i > 0)
A <: B
B is the only child of A
A >: B
A is the only child of B
A <<: B
A dominates B via an unbroken chain (length > 0) of unary local trees.
A >>: B
A is dominated by B via an unbroken chain (length > 0) of unary local trees.
A $++ B
A is a left sister of B (same as $.. for context-free trees)
A $-- B
A is a right sister of B (same as $,, for context-free trees)
A $+ B
A is the immediate left sister of B (same as $. for context-free trees)
A $- B
A is the immediate right sister of B (same as $, for context-free trees)
A $.. B
A is a sister of B and precedes B
A $,, B
A is a sister of B and follows B
A $. B
A is a sister of B and immediately precedes B
A $, B
A is a sister of B and immediately follows B
A <+(C) B
A dominates B via an unbroken chain of (zero or more) nodes matching description C
A >+(C) B
A is dominated by B via an unbroken chain of (zero or more) nodes matching description C
A .+(C) B
A precedes B via an unbroken chain of (zero or more) nodes matching description C
A ,+(C) B
A follows B via an unbroken chain of (zero or more) nodes matching description C
A <<# B
B is a head of phrase A
A >># B
A is a head of phrase B
A <# B
B is the immediate head of phrase A
A ># B
A is the immediate head of phrase B
A == B
A and B are the same node
A : B
[this is a pattern-segmenting operator that places no constraints on the relationship between A and B]
Label descriptions can be literal strings, which much match labels exactly, or regular
expressions in regular expression bars: /regex/. Literal string matching proceeds as
String equality. In order to prevent ambiguity with other Tregex symbols, only standard
"identifiers" are allowed as literals, i.e., strings matching [a-zA-Z]([a-zA-Z0-9_])* .
If you want to use other symbols, you can do so by using a regular expression instead of
a literal string. A disjunctive list of literal strings can be given separated by '|'.
The special string '__' (two underscores) can be used to match any node. (WARNING!!
Use of the '__' node description may seriously slow down search.) If a label description
is preceeded by '@', the label will match any node whose basicCategory matches the description.
NB: A single '@' thus scopes over a disjunction specified by '|': @NP|VP means things with basic category NP or VP.
Label description regular expressions are matched as find(), as in Perl/tgrep;
you need to specify ^ or $ to constrain matches.
In a chain of relations, all relations are relative to the first node in the chain.
For example, (S < VP < NP) means an S over a VP and also over an NP. If instead what
you want is an S above a VP above an NP, you should write S < (VP < NP).
Nodes can be grouped using parentheses '(' and ')' as in S < (NP $++ VP) to match an S
over an NP, where the NP has a VP as a right sister.
Boolean relational operators
Relations can be combined using the '&' and '|' operators, negated with the '!' operator,
and made optional with the '?' operator. Thus (NP < NN | < NNS) will match an NP node
dominating either an NN or an NNS. (NP > S & $++ VP) matches an NP that is both under
an S and has a VP as a right sister.
Relations can be grouped using brackets '[' and ']'. So the expression
NP [< NN | < NNS] & > S
matches an NP that (1) dominates either an NN or an NNS, and (2) is under an S. Without
brackets, & takes precedence over |, and equivalent operators are left-associative. Also
note that & is the default combining operator if the operator is omitted in a chain of
relations, so that the two patterns are equivalent:
(S < VP < NP)
(S < VP & < NP)
As another example, (VP < VV | < NP % NP) can be written explicitly as (VP [< VV | [< NP & % NP] ] ).
Relations can be negated with the '!' operator, in which case the expression will match
only if there is no node satisfying the relation. For example (NP !< NNP) matches only
NPs not dominating an NNP. Label descriptions can also be negated with '!': (NP < !NNP|NNS)
matches NPs dominating some node that is not an NNP or an NNS.
Relations can be made optional with the '?' operator. This way the expression will match even
if the optional relation is not satisfied. This is useful when used together with node naming
(see below).
Basic Categories
In order to consider only the "basic category" of a tree label, i.e. to ignore functional tags
or other annotations on the label, prefix that node's description with the @ symbol. For example
(@NP < @/NN.?/). This can only be used for individual nodes; if you want all nodes to use the
basic category, it would be more efficient to use a TreeNormalizer to remove functional tags
before passing the tree to the TregexPattern.
Segmenting patterns
The ":" operator allows you to segment a pattern into two pieces. This can simplify your pattern
writing. For example, the pattern S : NP matches only those S nodes in trees that also have an NP node.
Naming nodes
Nodes can be given names (a.k.a. handles) using '='. A named node will be stored in a map that
maps names to nodes so that if a match is found, the node corresponding to the named node can
be extracted from the map. For example (NP < NNP=name) will match an NP dominating an NNP
and after a match is found, the map can be queried with the name to retreived the matched node
using {@link TregexMatcher#getNode(Object o)} with (String) argument "name" (not "=name"). Note
that you are not allowed to name a node that is under the scope of a negation operator (the
semantics would be unclear, since you can't store a node that never gets matched to). Trying to
do so will cause a ParseException to be thrown. Named nodes can be put within the scope of an
optional operator.
Named nodes that refer back to previous named nodes need not have a node description -- this is
known as "backreferencing". In this case, the expression will match only when all instances of
the same name get matched to the same tree node. For example, the pattern:
(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
matches only an NP dominating exactly the sequence NP, NP; the mother NP cannot have any other
daughters. Multiple backreferences are allowed. If the node with no node description does not
refer to a previously named node, there will be no error, the expression simply will not match
anything.
Another way to refer to previously named nodes is with the "link" symbol: '~'. A link is like a
backreference, except that instead of having to be <i>equal to</i> the referred node, the
current node only has to match the label of the referred to node. A link cannot have a node
description, i.e. the '~' symbol must immediately follow a relation symbol.
Variable Groups
If you write a node description using a regular expression, you can assign its matching groups to
variable names. If more than one node has a group assigned to the same variable name, then matching
will only occur when all such groups capture the same string. This is useful for enforcing
coindexation constraints. The syntax is:
/ <regex-stuff> /#<group-number>%<variable-name>
For example, the pattern (designed for Penn Treebank trees):
@SBAR < /^WH.*-([0-9]+)$/#1%index<<(__=empty < (/^-NONE-/< /^\\*T\\*-([0-9]+)$/#1%index))
will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty.
\ No newline at end of file
TSURGEON SYNTAX
-----------------------------------------
Legal operation syntax and semantics (see Examples section for further detail):
delete <name_1> <name_2> ... <name_m>
For each name_i, deletes the node it names and everything below it.
prune <name_1> <name_2> ... <name_m>
For each name_i, prunes out the node it names. Pruning differs from
deletion in that if pruning a node causes its parent to have no
children, then the parent is in turn pruned too.
excise <name1> <name2>
The name1 node should either dominate or be the same as the name2
node. This excises out everything from name1 to name2. All the
children of name2 go into the parent of name1, where name1 was.
relabel <name> <new-label>
Relabels the node to have the new label. There are three possible forms
for the new-label:
relabel nodeX VP - for changing a node label to an alphanumeric
string, relabel nodeX /''/ - for relabeling a node to something that
isn't a valid identifier without quoting, and relabel nodeX
/^VB(.*)$/verb\/$1/ - for regular expression based relabeling. In the
last case, all matches of the regular expression against the node
label are replaced with the replacement String. This has the semantics
of Java/Perl's replaceAll: you may use capturing groups and put them
in replacements with $n. Also, as in the example, you can escape a
slash in the middle of the second and third forms with \/ and \\.
This last version lets you make a new label that is an arbitrary
String function of the original label and additional characters that
you supply.
insert <name> <position>
insert <tree> <position>
inserts the named node, or a manually specified tree (see below for
syntax), into the position specified. Right now the only ways to
specify position are:
$+ <name> the left sister of the named node
$- <name> the right sister of the named node
>i <name> the i_th daughter of the named node.
>-i <name> the i_th daughter, counting from the right, of the named node.
move <name> <position>
moves the named node into the specified position. To be precise, it
deletes (*NOT* prunes) the node from the tree, and re-inserts it
into the specified position. See above for how to specify position
replace <name1> <name2>
deletes name1 and inserts a copy of name2 in its place.
adjoin <tree> <target-node>
adjoins the specified auxiliary tree (see below for syntax) into the
target node specified. The daughters of the target node will become
the daughters of the foot of the auxiliary tree.
adjoinH <tree> <target-node>
similar to adjoin, but preserves the target node and makes it the root
of <tree>. (It is still accessible as <code>name</code>. The root of
the auxiliary tree is ignored.)
adjoinF <tree> <target-node>
similar to adjoin, but preserves the target node and makes it the foot
of <tree>. (It is still accessible as <code>name</code>, and retains
its status as parent of its children. The foot of the auxiliary tree
is ignored.)
coindex <name_1> <name_2> ... <name_m>
Puts a (Penn Treebank style) coindexation suffix of the form "-N" on
each of nodes name_1 through name_m. The value of N will be
automatically generated in reference to the existing coindexations
in the tree, so that there is never an accidental clash of
indices across things that are not meant to be coindexed.
-----------------------------------------
Syntax for trees to be inserted or adjoined:
A tree to be adjoined in can be specified with LISP-like
parenthetical-bracketing tree syntax such as those used for the Penn
Treebank. For example, for the NP "the dog" to be inserted you might
use the syntax
(NP (Det the) (N dog))
That's all that there is for a tree to be inserted. Auxiliary trees
(a la Tree Adjoining Grammar) must also have exactly one frontier node
ending in the character "@", which marks it as the "foot" node for
adjunction. Final instances of the character "@" in terminal node labels
will be removed from the actual label of the tree.
For example, if you wanted to adjoin the adverb "breathlessly" into a
VP, you might specify the following auxiliary tree:
(VP (Adv breathlessly) VP@ )
All other instances of "@" in terminal nodes must be escaped (i.e.,
appear as \@); this escaping will be removed by tsurgeon.
In addition, any node of a tree can be named (the same way as in
tregex), by appending =<name> to the node label. That name can be
referred to by subsequent tsurgeon operations triggered by the same
match. All other instances of "=" in node labels must be escaped
(i.e., appear as \=); this escaping will be removed by tsurgeon. For
example, if you want to insert an NP trace somewhere and coindex it
with a node named "antecedent" you might say
insert (NP (-NONE- *T*=trace)) <node-location>
coindex trace antecedent $
-----------------------------------------
Examples of Tsurgeon operations:
Tree (used in all examples):
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
Apply delete:
VP < PP=prep
delete prep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(. .)))
The PP node directly dominated by a VP is removed, as is
everything under it.
Apply prune:
S < (NP < NNP=noun)
prune noun
Result:
(ROOT
(S
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
The NNP node is removed, and since this results in the NP above it
having no terminal children, the NP node is deleted as well.
Note: This is different from delete in which the NP above the NNP
would remain.
Apply excise:
VP < PP=prep
excise prep prep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(IN in)
(NP (NNP May)))))
(. .)))
The PP node is removed, and all of its children are added in the
place it was previously located. Excise removes all the nodes from
the first named node to the second named node, and the children of
the second node are added as children of the parent of the first node.
Thus, for another example:
VP=verb < PP=prep
excise verb prep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(IN in)
(NP (NNP May)))
(. .)))
Apply relabel:
VP=v < PP=prep
relabel prep verbPrep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(verbPrep (IN in)
(NP (NNP May)))))
(. .)))
The label for the node called prep (PP) is changed to verbPrep.
The other form of relabel uses regular expressions; consider the following
operation:
/^VB.+/=v
relabel v /^VB(.*)$/ #1
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (D was)
(VP (N arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
The Tregex pattern matches all nodes that begin "VB" and have at least one
more character. The Tsurgeon operation then matches the node label to the
regular expression "^VB(.*)$" and selects the text matching the first part
that is not completely specified in the pattern. In this case, that is the
part matching the wildcard (.*), which matches all characters after the VB.
The node is then relabeled with that part of the text, causing, for example,
"VBD" to be relabeled "D". The "#1" specifies that the name of the node
should be the first group in the regex.
Apply insert (shown here with inserting a node, but could also be a tree):
S < (NP < (NNP=name !$- DET))
insert (DET Ms.) $+ name
Result:
(ROOT
(S
(NP (DET Ms.)
(NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
The pattern matches the NNP node that is directly dominated by an NP
(which is directly dominated by an S) and is not a direct right sister
of a DET. Thus, the (DET Ms.) node is inserted immediately to the left
of that NNP node, as specified by "$+ name". "$+" is the location and
"name" describes what node the location is with respect to.
Note: Tsurgeon will re-search for matches after each run of the script;
thus, cycles may occur, causing the program to not terminate. The key
is to write patterns that match prior to the changes you would like to
make but that do not match afterwards. If the clause "!$- DET" had been
left out in this example, Tsurgeon would have matched the pattern after
every insert operation, causing an infinite number of DETs to be added.
Apply move:
VP=verb < PP=prep
move prep $- verb
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)))
(PP (IN in)
(NP (NNP May)))
(. .)))
The PP is moved out of the VP that dominates it and added as a direct right
sister of the VP. As for insert, "$-" specifies the location for prep while
"verb" specifies what that location is relative to.
Note: "move" is a macro operation that deletes the given node and then inserts
it. "move" does not use prune, and thus any branches that now lack terminals will
remain rather than being removed.
Apply replace:
S < (NP=name < NNP)
replace name (NP (DET A) (NN woman))
Result:
(ROOT
(S
(NP (DET A)
(NN woman))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
"name" is matched to an NP that is dominated by an S and dominates an NNP, and
a new subtree ("(NP (DET A) (NN woman))") is added in the place where "name" was.
Note: This operation is vulnerable to falling into an infinite loop. See the note
concerning the "insert" operation and how patterns are matched.
Apply adjoin:
S < (NP=name < NNP)
adjoin (NP (DET A) (NN woman) NP@) name
Result:
(ROOT
(S
(NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia)))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
First, the NP is matched to the NP dominating the NNP tag. Then, the specified
tree ("(NP (DET A) (NN woman) NP@)") is placed in that location. The "@" symbol
specifies that the children of the original NP node ("name") are to be placed
as children of a new NP node that is directly to the right of (NN woman). If
the specified tree were "(NP (DET A) (NN woman) VP@)" then the child
(NNP Maria_Eugenia_Ochoa_Garcia) would appear under a VP. Exactly one "@" node
must appear in the specified tree in order to indicate where to place the node
from the original tree.
Apply adjoinH:
S < (NP=name < NNP)
adjoinH ((NP (DET A) (NN woman) NP@)) name
Result:
(ROOT
(S
(NP (NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia))))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
This operation differs from adjoin in that it retains the named node (in this
case, "name"). The named node is made the root of the specified tree, resulting
in two NP nodes dominating the DET in this example whereas only one was present
in the previous example. Note that the specified tree is wrapped in an extra
pair of parentheses in order to show the syntax for retaining the named node.
If the extra parentheses were not there and the specified tree was, for example,
(VP (DET A) (NN woman) NP@), the VP would be ignored in order to retain an NP as
the root. Thus, in this case, "adjoinH (VP (DET A) (NN woman) NP@) name" and
"adjoinH ((DET A) (NN woman) NP@) name" both produce the same tree:
(ROOT
(S
(NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia)))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
Apply adjoinF:
S < (NP=name < NNP)
adjoinF (NP(DET A) (NN woman) @) name
Result:
(ROOT
(S
(NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia)))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
This operation is very similar to adjoin and adjoinH, but this time the original
named node ("name" in this case) is maintained as the root of the subtree that
is adjoined. Thus, no node label needs to be given in front of the "@" and if
one is given, it will be ignored. For instance, "adjoinF (NP(DET A) (NN woman) VP@) name"
would still produce the same tree as above, despite the VP preceding the @.
Apply coindex:
NP=node < NNP=name
coindex node name
Result:
(ROOT
(S
(NP-1 (NNP-1 Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP-2 (NNP-2 May)))))
(. .)))
This causes the named nodes to be numbered such that all nodes that are part
of the same match have the same number and all matches have distinct new names.
We had two instances of an NP dominating an NNP in this example, and they were
renamed such that NP-i < NNP-i for each match, with 1 <= i <= number of matches.
\ No newline at end of file
......@@ -41,7 +41,7 @@ public class PBInstance
public String toString()
{
StringBuilder build = new StringBuilder();
build.append(treePath); build.append(" ");
build.append(treeId); build.append(" ");
build.append(predId); build.append(" ");
......
<
......@@ -23,20 +23,19 @@
*/
package jubilee.propbank;
import jdsl.core.ref.NodeTree;
import jubilee.hindi.HDUtil;
import jubilee.toolkit.JBToolkit;
import jubilee.treebank.TBReader;
import jubilee.treebank.TBTree;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Scanner;
import java.util.StringTokenizer;
import jdsl.core.ref.NodeTree;
import jubilee.hindi.HDUtil;
import jubilee.toolkit.JBToolkit;
import jubilee.treebank.TBReader;
import jubilee.treebank.TBTree;
/**
* 'PBReader' reads a Propbank annotation file and stores all information from both treebank
* and annotation into vectors.
......@@ -76,43 +75,82 @@ public class PBReader
private TBTree p_tree;
private String s_treeDir;
private int i_currIdx; // index of the current tree
private HashMap<String,String> m_context;
private int i_lastEditIdx; // index of the last edited tree
// private HashMap<String,String> m_context;
// XXX We assume that context is constant for each task