Beta's projects for MSc students 2018—2019 semester 2

General

Automatic Cantonese speech translation

Project info

1-2 student, maximum 3 groups | Research and Development project | Streams: General, Multimedia Computing | Updated 2019-01-04[5]

Description

Though systems such as Siri supports Cantonese, Cantonese translation is still not supported.

Automatic translation from spoken English to spoken Cantonese is simply not found. For written Cantonese, a system that is accepted by most people does not seem to have been developed.

There are variants of Cantonese transcription, or jyutping, and active research in Cantonese linguistics

With comprehensive Cantonese resources such as jyuping input method, ShefCE Cantonese-English Bilingual Speech Corpus, and machine translation system such as Moses, there must be some aspects of the language that makes automatic translation difficult.

Students working on this project will carry out a comprehensive study of the issues around automatic Cantonese speech translation, and make reference implementations that address these issues, towards a practical solution of automatic Cantonese-to-English, or English-to-Cantonese speech translator.

Keywords: Cantonese, English, tranlation, jyuping, ShefCE, Moses, speech processing, cepstrum, formants, MFCC

Requirements

Deliverables

References

Although some resources here are Python libraries, there is no restriction on the languages and tools you use. Indeed, a good data analysis and machine learning project like this one often requires the use of multiple languages.

  1. Online jyuping input method
  2. ShefCE: A Cantonese-English Bilingual Speech Corpus
  3. Moses: statistical machine translation system
  4. Audacity

Real-time Speaker Recognizer

Project info

1 student, maximum 3 groups | Development project | Streams: General, Information Security, Multimedia Computing | Updated 2017-12-18[1]

Description

Read the general instructions at the General section first.

Build a system that recognizes that labels the speakers in a recorded radio talk show or phone-in show, without need of prior training.

A more advanced version of the system should be language-independent. It should be able to take live speech, generating output as the input is analyzed, and possibly correcting earlier outputs when necessary.

Note that the student is expected to build their own collection of training and testing data.

Experimentation using systems such as Audacity, PureData, Octave, Mathematica, or Matlab is expected.

The system should be implemented in an operating-system independent way.

Application: some biometrics systems authenticate the user by speaker recognition. Your study may shed light on the usability of such a system.

Keywords: cepstrum, formants, MFCC, speaker diarisation

Requirements

Deliverables

References

Although some resources here are Python libraries, there is no restriction on the languages and tools you use. Indeed, a good data analysis and machine learning project like this one often requires the use of multiple languages.

  1. Audacity
  2. PureData
  3. GNU Octave
  4. SciPy
  5. NumPy
  6. matplotlib
  7. Archive of Speaker Diarization project at UC Berkeley

Bird Sound Recognizer

Project info

1 student, maximum 3 groups | Development project | Streams: General, Multimedia Computing | Updated 2017-12-18[1]

Description

Read the general instructions at the General section first.

Listening to bird sounds turns out to be very important for bird watchers, as it not only locates the birds but also identify which species they are.

The project is about building a system that recognizes sounds of birds.

The project involves cleaning the sound, extracting the relevant section of course, identifying features, and matching against a database or recoginsing through a neural network or other methods.

Though it seems standard pattern recognition techniques can be applied to solve the problem, the variations of sounds of birds from the same species and the difficulty to collect enough samples makes the project practically difficult.

The problem is so interesting that the AI community have projects such as A.I.Experiments: Bird Sounds that visualises and clusters similar bird sounds.

Note that the student is expected to build their own collection of training and testing data. HKBWS Bird Call page is a good starting point. There are CDs of bird calls in the market as well.

It is best to start from a small collection of sounds of a few distinct species, and expand it to 20, 50 or 100.

The much larger collection of bird sounds of North America birds from The Macaulay Library of the Cornell Lab of Ornithology can be used to see later if your recognizer can handle a different data set.

Experimentation and visualization using systems such as Audacity, PureData, GNU Octave, Mathematica, or Matlab would be the fun part of the project.

The system should be implemented in an operating-system independent way.

Requirements

Deliverables

References

Although some resources here are Python libraries, there is no restriction on the languages and tools you use. Indeed, a good data analysis and machine learning project like this one often requires the use of multiple languages.

  1. HKBWS Bird Call page
  2. A.I.Experiments: Bird Sounds
  3. The Macaulay Library of the Cornell Lab of Ornithology
  4. Audacity
  5. PureData
  6. GNU Octave
  7. SciPy
  8. NumPy
  9. matplotlib

Sequential associations

Project info

1 student, maximum 3 groups | Research project | Streams: General, Information Security, Multimedia Computing, Financial Computing | Updated 2017-12-18[1]

Description

Read the general instructions at the General section first.

Events often happen with a cause. A cause may lead to a effect, but not always for certain. Sometimes, it is hard to say which event is the cause, and which is the effect. The events can be just correlated and are affected by the same cause, or they just happen together by chance.

When many Hang Seng Index (HSI) constituents rise, the Hang Seng Index tends to go up. Earthquake at one place may be followed up by earthquakes in some other places along a fault. When the subtropical ridge of high pressure area is more to the West, more typhoons are expected to enter the South China Sea [1]. Rise of some stock prices may cause the rise of the prices of commodities, and the most talked-about stocks on social media are the ones with greatest volatility. Some of these chages are immediate (e.g., HSI), and some delayed (e.g., earthquake). The cause of some delayed events may not be obvious (e.g., typhoon), and some seemingly correlated events may not even be have cause-effect relationship (e.g., commodity, forum) [2].

The project is about studying and applying algorithms on sequences of data to find out how they are associated with each other. The student is going to choose and acquire sequences of data they are interested in, and study and apply statistical techniques, sequential data mining algorithms, or temporal classification methods to find out how the events in the sequences are associated with each other. Applications include generating warnings (earthquake case), prediction of future stock index values (HSI case), prediction of the range of the number of typhoons entering South China Sea (typhoon case), or prediction of stock prices given market information (commodity and forum cases).

Requirements

Deliverables

References

  1. Why Tropical Cyclone Recurves? PAN Chi-kin; Hong Kong Observatory 2011-09.
  2. Spurious correlations

Investment recommendation system

Project info

1–2 students, maximum 3 groups | Research project | Streams: General, Financial Computing | Updated 2017-12-18[1]

Description

Read the general instructions at the General section first.

From historical and current or near-current data of a stock or index prices, design an algorithm that makes investment recommendations (e.g., buy, hold, sell, hedge) that would maximize the profit in a simulated environment.

Some factors the algorithm can take into the consideration include the day in month, weekday of day, time of day, various financial indicators, correlations between data from different time series.

Time series of non-numerical data such as news articles, Twitter feeds, or Facebook posts can be analysed to improve the accuracy of the prediction. Indeed, this has been proven to be quite effective in prior studies.

Note that the student is expected to build their own collection of training and testing data.

Be very careful about accuracy claims of better than 70% when you do literature research on how good their systems are, especially when the system uses historical numerical data or financial indicators only.

Students producing good results early in the project can be linked up with real investment recommendation firms who can provide real time data for testing and may see your algorithms implemented for real.

Requirements

Deliverables

References

Although some resources here are Python libraries, there is no restriction on the languages and tools you use. Indeed, a good data analysis and machine learning project like this one often requires the use of multiple languages.

  1. Apache Spark
  2. Apache Spark MLlib
  3. SciPy
  4. NumPy
  5. matplotlib
  6. Weka
  7. scikit-learn