Chinese character and word analysis in daily essays

500 characters represent over 75% occurrence in newspapers and books. The project needs to analyze the pattern and usage of characters and words (mainly 2 characters) in daily essays.

Project Background

Chinese language is one of the most commonly used languages in our world, which covers approximately 1.2 billion people all over world. In addition, it is used by the majority of people living in Hong Kong, Mainland China and Taiwan. In the age of Internet, more and more online Chinese media and social media platforms have been arisen such that we can find many articles or online discussions written in Chinese. We believe that these materials reflect the cultural values and the trend in the society which is valuable to be studied.

However, unlike English, Chinese is a language written without spaces between words. This characteristic makes software difficult to retrieve every single word from an article and conducts subsequent analysis. In order to develop software which can process Chinese article effectively, we need to design special algorithm and script with a database to achieve this goal. Yet, we found that very few word retrieval tools exist for Chinese so we decided to work on it.

We believe that by developing software to analyze the pattern and usage of characters and words in daily use, we will be able to produce a lot of meaningful for subsequent studies like in the cultural area.

Project Objective

The ultimate goal of the project is to develop an online Chinese words analyzation tool which can display the statistic data (e.g. frequency of use, words relationship, domain origin, etc.) of each Chinese word.

To achieve this, we have the following sub-objectives:

  • As the core of our project, we will design a segmentation algorithm that can effectively break down a Chinese essay into individual words that represent the closest meaning.

  • We will develop a backend natural language processing tool that can using the segmentation algorithm to receive Chinese text from different systems and producing individual words from the text.

  • We will develop a backend analyzation tool that works with the database to calculate the frequency count, word relationship, sources of words, etc.

  • We will develop a web spider which is able to automatically fetch Chinese article / content from the Internet for subsequent analysis.

  • We will design a database which can store Chinese words, statistic associated to words and web content retrieved from the spider in an organized and effective way.

  • We will design a front-end webpage to allow users to enter the website links for analysis, including Chinese character or word usage, and also the pattern of an essay.

Project Proposal Interim Report Final Group Report

Project Schedule and Milestones

Milestone

Date of Completion

Deliverables

1. Project Initialization

10 July 2015

-

2. Project Analysis and Design

26 September 2015

-

3. Phase 1 Deliverables

4 October 2015

Project website
Project plan

4. Raw Data Collection

10 January 2016

Implementation:

  • Web spider
  • Raw text preprocessing

5. Algorithm Design and Implementation

10 January 2016

Implementation:

  • Word segmentation
  • Word tagging and categorization
  • Information extraction

6. Phase 2 Deliverables

11 January 2016

Preliminary Implementation
Interim presentation materials
Interim report

7. Database Design and Implementation

28 February 2016

Implementation:

  • Structural database
  • Word tree bank

8. Front-end Design andImplementation

31 March 2016

Implementation:

  • Web app
  • Word analysis toolbox

9. Phase 3 Deliverables

17 April 2016

Finalized Implementation
Final report

10. Final Presentation

22 April 2016

Final presentation materials

Our Team

Supervisor: Dr. Vincent Lau


fyp15002@cs.hku.hk