cmcrc_logo
CMCRC Honours Projects in Text Mining
Article Index
CMCRC Honours Projects in Text Mining
Project Listing
Supervision
All Pages

Overview

This document contains a list of possible honours projects in financial text mining with the Capital Markets Cooperative Research Centre (CMCRC).

The Capital Markets Cooperative Research Centre

CMCRC is a $100 million facility backed by The Australian Federal Government with a track record of commercial success in developing capital markets technologies.  In addition to finance and information technology researchers from 6 universities, the CMCRC consortium consists of 21 industry partners including securities exchanges and related technology and data providers.

http://www.cmcrc.com

The Projects

The proposed projects are listed here.  All projects will be carried out in conjunction with the CMCRC and could involve CMCRC industrial partners.  Furthermore, all projects may attract scholarships of $5000.  Keep an eye on this page for updates concerning partners and scholarships associated to particular projects.

Supervision

Possible supervisors are listed here.  Each project will have two Macqurie staff as official supervisors.  In addition, they will have a third supervisor from the CMCRC.

Contact

Feel free to email with any questions:

  • Ben Hachey <bhachey AT cmcrc DOT com>
  • Jean-Yves Delort <jydelort AT cmcrc DOT com>
  • Diego Mollá Aliod <diego DOT molla-aliod AT mq DOT edu DOT au>

 



Aggregation and Presentation for Trading and Surveillance

Explaining price-sensitivity based on classifier decisions
Text classification technology provides a means for predicting whether a document is price-sensitive (i.e., whether it will impact stock prices).  However, it does not provide a means for verification of system predictions.  The goal of this project is to use sentence extraction techniques from the summarisation literature to provide a brief explanation of classifier output so that a human analyst can quickly verify whether a prediction is sound before acting on it.

 

Explaining finance relations
In capital markets, it is easy to identify associations between companies by testing whether their share prices tend to move together.  However, this information is not necessarily useful to a trader or surveillance analyst unless they can identify a causal or co-effect relation between the two companies.  This project will explore techniques for aggregating and presenting relation type information from various structured and unstructured sources (e.g., industry classifications, index membership, textual descriptions in news or other documents).

 

Biographical sketches for finance entities
In addition to interpreting relations between entities like companies and people, an analyst needs to be able to find out quickly who the entities are and what they do.  This project will explore the use of techniques from the summarisation literature for aggregating and presenting biographical information about finance entities from various sources (e.g., company announcements, news, forums).

 

Improving Commercial Information Retrieval for E-mail

Conceptual search
Conceptual search aims to improve the user experience by organising search results by topic or concept.  This is commonly done by automatic clustering or by grouping results with respect to a pre-determined taxonomy.  The goal of this project is to add conceptual search to an existing commercial search engine.  The work will be carried out in collaboration with Nuix, who will provide the data set.  This project has a scholarship attached of $5000.

http://www.nuix.com/

 

Explaining e-mail relations
Current search suites provide tools for visualising email networks for an organisation.  However, it does not incorporate a description of the type of relationship that exists between email partners, which would make the network easier to interpret and allow filtering by relation types.  The goal of this project is to develop a system for automatically identifying relation types for email partners. The work will either use the Enron email data set or it will be carried out in collaboration with an industry partner.

 

Tools for Assisted Curation of Financial Databases

Extracting board members from company announcements
Current financial information providers often pay human analysts to read company announcements and extract information such as board membership and top shareholders.  The goal of this project will be to induce an automatic system to perform these tasks.  The work will either use existing annotated data at the CMCRC or it will be carried out in collaboration with an industry partner.

 

Extracting profit/loss information from company announcements
Financial information providers also curate databases of profit/loss information.  The goal of this project will be to induce an automatic system to extract and normalise this information (e.g.., forecasted and actual earnings figures).  The work will either use existing annotated data at the CMCRC or it will be carried out in collaboration with an industry partner.

 

General Financial Information Extraction Problems

Inducing name/term identification from meta data
The Reuters News Archive (RNA) is a large corpus of newswire data that is richly annotated at the document level with meta data such as company names.  The goal of this project is to induce a system to automatically identify such terms by mapping entities in the human-authored meta data to actual character strings in the raw text.

 

Matching company names using learnt similarity measures
Company names can be referenced in various ways in text (e.g., BHP, BHP Billiton, BHP Billiton Limited, BHP Ltd).  While named entity recognition technology is capable of identifying these mentions with high accuracy, it does not always result in a direct match to registered company names associated to stock ticker codes.  The goal of this project is to build a system that automatically matches different reference to the same company.  An interesting approach is to use machine learning to train a specialised string similarity measure.

 

Analysis of sentiment classification and market behaviour
Sentiment analysis aims to automatically determine whether a text (e.g., from company announcements, news, forums) is favourably disposed towards its subject.  The goal of this project is to investigate the relationship between sentiment engine output and market behaviour.

 



CMCRC-Macquarie Supervision

Dr Jean-Yves Delort, Research Fellow
Jean-Yves Delort has a PhD in Computer Science from the University of Pierre and Marie Curie (Paris 6). Before joining the CMCRC, Jean-Yves was Senior Lecturer in Computer Science at the University of Montpellier and was a member of the research group on Hypermedia and Human Computer Interaction.  Jean Yves’ core interests are in information retrieval and human-computer interaction with a focus on automatic summarisation and visualisation.  Jean-Yves will serve as a first or second supervisor on the projects listed here.
http://web.science.mq.edu.au/~jydelort/


Dr Ben Hachey, Research Fellow
Ben Hachey has a PhD in Informatics from the University of Edinburgh, where he was a member of the Language Technology Group for 6 years as a Research Associate and then a student.   Ben’s core interests are in building usable text analytics tools with a focus on information extraction, automatic summarisation and information aggregation, minimally supervised machine learning and evaluation. Ben will serve as a first or second supervisor on the projects listed here.
http://web.science.mq.edu.au/~bhachey/


Macquarie Supervision

Dr Diego Mollá Aliod, Senior Lecturer
Diego Mollá-Aliod has a PhD in Linguistics from the University of Edinburgh.  Diego’s interests are centered on the application of theoretical linguistics to specific real-world problems, in particular to automated text-based question answering.  Diego may serve as a first or second supervisor on the projects listed here.
http://web.science.mq.edu.au/~diego/

 

CMCRC Project Coordination

Dr Maria Milosavljevic, CTO
Maria Milosavljevic has 20 years of experience in language technology and knowledge-based systems.  She was awarded a PhD in Language Technology from the Microsoft Research Institute at Macquarie University.  Prior to joining CMCRC, Maria has held research roles at CSIRO, the University of Edinburgh and Macquarie University.  Maria's main area of interest is in the use of text analytics in the fraud detection and intelligence area.  Maria will serve as a third supervisor on the projects listed here.
http://web.science.mq.edu.au/~mariam/