Optimization Issues in Machine Learning of Coreference Resolution

Véronique Hoste

Complete manuscript

pdf (1.2M)
ps (2.0M)

Title page

pdf (43K)
ps (331K)


  1. Abstract

    pdf (25K)
    ps (424K)
  2. Introduction

    This thesis is about the automatic resolution of coreference using machine learning techniques. It is a research area which is becoming increasingly popular in natural language processing (NLP) research and it is a key task in applications such as machine translation, automatic summarization and information extraction for which text understanding is of crucial importance. When people communicate, they aim for cohesion. Text is therefore ``not just a string of sentences. It is not simply a large grammatical unit, something of the same kind as a sentence, but differing from it in size--a sort of supersentence, a semantic unit.''(Halliday and Hasan 1976, p. 291). Coreference, in which the interpretation of an element in conversation depends on a previously mentioned element, is one possible technique to achieve this cohesion, a technique to construct that supersentence. Through the use of shorter or alternative linguistic structures which refer to previously mentioned elements in spoken or written text, coherent communication can be achieved. A good text understanding largely depends on the correct resolution of these coreferential relations.
    In this introductory chapter, we provide a definition of coreference and anaphora and discuss existing knowledge-based and corpus-based approaches to the task of automatic coreference resolution. The remainder of the chapter introduces the present study, lists the central research objectives and gives an overview of this thesis.
    pdf (87K)
    ps (460K)
  3. Coreferentially annotated corpora

    In the experiments reported in this thesis, we use two inductive learning methods, viz. memory-based learning and rule induction, to resolve coreferential relations between nominal constituents. Since these corpus-based methods depend on the quality of the corpora they are trained on, we will discuss in this chapter the importance of coreference annotation. Section 2.1 introduces the topic of coreference annotation. In Section 2.2 and Section 2.3, we introduce the two corpora we will use for our experiments: the well-known and widely used MUC-6 and MUC-7 corpora for English and the newly developed KNACK-2002 corpus for Dutch. Section 2.2 describes the MUC-6 and MUC-7 annotation markup, the annotated relations and the resulting training and test corpora. Section 2.3 has a similar setup but focuses on the distinctive features of the KNACK-2002 annotation guidelines. Section 2.4 discusses the problem of inter-annotator agreement.
    pdf (116K)
    ps (471K)
  4. Information sources

    In supervised learning of coreference resolution, one is given a training set containing labeled instances. These instances consist of attribute/value pairs which contain possibly disambiguating information for the classifier, whose task it is to accurately predict the class of novel instances. A good set of features is crucial for the success of the resolution system. An ideal feature vector consists of features which are all highly informative and which can lead the classifier to optimal performance. This implies that irrelevant features should be avoided, since the learner can have difficulty in distinguishing them from the relevant features when making predictions. Furthermore, it is important to keep the attribute noise as low as possible, since errors in the feature vector can heavily affect the predictions.
    This chapter deals with the problem of the selection of information sources for the resolution of coreferential relations. The first section (3.1) discusses the preparation of the data sets. We describe the different preprocessing steps that were taken for the construction of the training and test corpora. We briefly mention the problem of the selection of positive and negative instances and the related problem of the skewed class distributions (Chapter 7 will extensively deal with the problem of highly skewed training data). In 3.1.3 we explore the use of three different data sets, viz. one for the pronouns, one for the named entities and a third data set for the other noun phrases, instead of one single data set. Section 3.2 gives an overview of the information sources which have been used in other work on coreference resolution. In this overview, we focus on the shallow information sources which can easily be computed. We continue with a description of the different features which we used for our experiments.
    pdf (174K)
    ps (555K)
  5. Machine learning of coreference resolution

    We will continue in this chapter with a description of the machine learners which operate on the basis of the feature vectors explained in the previous chapter.
    This chapter consists of two main parts. The first three sections introduce the term `bias' and the two machine learning packages which we will use in our experiments: the memory-based learning package TIMBL, and the rule induction package RIPPER. In the second part, Section 4.4, we describe the general setup of our experiments, discuss the different classifier performance measures and we apply the two methods to the MUC-6/-7 and KNACK-2002 validation data sets.
    pdf (167K)
    ps (535K)
  6. Selecting the optimal information sources and algorithm settings

    In the previous chapters we paved the way for our coreference resolution system. We constructed features which we believe to be helpful in disambiguating between coreferential and non-coreferential relations and we selected two machine learning approaches to experiment with. Furthermore, we ran an initial experiment with our coreference resolution system. In this chapter and the following chapter on genetic algorithms, we will discuss some methodological issues involved in running a machine learning (of language) experiment. We will show empirically that current methodology in comparative machine learning of language literature often leads to methodologically debatable results. In this chapter, we consider at length the importance of feature selection and the importance of the optimization of the algorithm parameters and we apply both optimization passes to our coreference resolution data sets.
    pdf (209K)
    ps (742K)
  7. Genetic algorithms for optimization

    In the previous chapter, we showed that a proper comparative experiment requires extensive optimization and that the performance increase obtained by this optimization is considerable. In the feature selection experiments, we could observe the large effect feature selection can have on classifier performance. And in the parameter optimization experiments, we observed large deviations which confirm the necessity of parameter optimization. In these previous experiments, we explored feature selection while keeping the parameters constant and we explored parameter optimization while keeping the feature vector unchanged. We did not consider the interaction between feature selection and parameter optimization.
    We will now proceed to a next optimization step in a set of experiments performing joint feature selection and parameter optimization. Joint feature selection and parameter optimization is essentially an optimization problem which involves searching the space of all possible feature subsets and parameter settings to identify the combination that is optimal or near-optimal. Due to the combinatorially explosive nature of this type of experiment, a computationally feasible way of optimization has to be found. This chapter investigates the use of a wrapper-based approach to feature selection using a genetic algorithm in conjunction with our two learning methods, TIMBL and RIPPER. In Section 6.1, we give an introduction to genetic algorithms. Section 6.2 discusses the implementation details for running the experiments and gives experimental results on the three data sets. We conclude this chapter with a summary and discussion.
    pdf (142K)
    ps (567K)
  8. The problem of imbalanced data sets

    A general goal of classifier learning is to learn a model on the basis of training data which makes as few errors as possible when classifying previously unseen test data. Many factors can affect the success of a classifier: the specific `bias' of the classifier, the selection and the size of the data set, the choice of algorithm parameters, the selection and representation of information sources and the possible interaction between all these factors. In the previous chapters, we experimentally showed for the eager learner RIPPER and the lazy learner TIMBL that the performance differences due to algorithm parameter optimization, feature selection, and the interaction between both easily overwhelm the performance differences between both algorithms in their default representation. We showed how we improved their performance by optimizing their algorithmic settings and by selecting the most informative information sources.
    In this chapter, our focus shifts, away from the feature handling level and the algorithmic level, to the sample selection level. We investigate whether performance is hindered by the imbalanced class distribution in our data sets and we explore different strategies to cope with this skewedness. In Section 7.1, we introduce the problem of learning from imbalanced data. In the two following sections, we discuss different strategies for dealing with skewed class distributions. In Section 7.2, we discuss several proposals made in the machine learning literature for dealing with skewed data. In Section 7.3, we narrow our scope to the problem of class imbalances when learning coreference resolution. In the remainder of the chapter, we focus on our experiments for handling the class imbalances in the MUC-6, MUC-7 and KNACK-2002 data sets.
    pdf (181K)
    ps (718K)
  9. Testing

    In all previous chapters, we reported cross-validation results on the training data. Defining the anaphora resolution process as a classification problem, however, involves the use of a two-step procedure. In a first step, the classifier (in our case TIMBL or RIPPER) decides on the basis of the information learned from the training set whether the combination of a given anaphor and its candidate antecedent in the test set is classified as a coreferential link. Since each NP in the test set is linked with several preceding NPs, this implies that one single anaphor can be linked to more than one antecedent, which for its part can also refer to multiple antecedents, and so on. Therefore, a second step is taken, which involves the selection of one coreferential link per anaphor.
    In the previous chapters, we focused on the first step by trying to reach the optimal result through feature selection, algorithm parameter optimization and different sampling techniques. In this chapter, we move away from the instance level and concentrate on the coreferential chains. This requires a new experimental setup (Section 8.1) with a new evaluation procedure (Section 8.2). In Section 8.3, we report the results of TIMBL and RIPPER on the different data sets. Section 8.4 describes the main observations from a qualitative error analysis on a selection of English and Dutch documents.
    pdf (169K)
    ps (545K)
  10. Conclusion

  11. In this thesis, we presented a machine learning approach to the resolution of coreferential relations between nominal constituents in Dutch. It is the first automatic resolution approach proposed for this language. In order to enable a corpus-based strategy, we first annotated a corpus of Dutch news magazine text, KNACK-2002, with coreferential information for pronominal, proper noun and common noun coreferences. A separate learning module was built for each of these NP types. The main motivation for this approach was that the information sources which are important for the resolution of the coreferential relations differ per NP type. This approach was not only applied to Dutch, for which no comparative results are yet available, but also to the well-known English MUC-6 and MUC-7 data sets.
    Coreference and the task of coreference resolution was the main point of interest in Chapters 2 and 3 and in Chapter 8 on testing. In the chapters in between, we focused on the methodological issues which arise when performing a machine learning of coreference resolution experiment, or more broadly, a machine learning of language experiment. In the following two sections, we discuss the main observations from the research questions formulated in Section 1.3.
    pdf (79K)
    ps (451K)
  12. References

    pdf (76K)
    ps (461K)


  1. Manual for annotation of coreferences in Dutch newspaper text

    pdf (163K)
    ps (511K)
  2. Ripper rules for the MUC-6 "Proper nouns" data set

    pdf (28K)
    ps (423K)
  3. Three MUC-7 documents for which a qualitative error analysis has been carried out

    pdf (64K)
    ps (469K)
  4. Three KNACK-2002 documents for which a qualitative error analysis has been carried out

    pdf (30K)
    ps (422K)