Chinese character and word analysis in daily essays

500 characters represent over 75% occurrence in newspapers and books. The project needs to analyze the pattern and usage of characters and words (mainly 2 characters) in daily essays.

Project Background

Chinese language is one of the most commonly used languages in our world, which covers approximately 1.2 billion people all over world. In addition, it is used by the majority of people living in Hong Kong, Mainland China and Taiwan. In the age of Internet, more and more online Chinese media and social media platforms have been arisen such that we can find many articles or online discussions written in Chinese. We believe that these materials reflect the cultural values and the trend in the society which is valuable to be studied.

However, unlike English, Chinese is a language written without spaces between words. This characteristic makes software difficult to retrieve every single word from an article and conducts subsequent analysis. In order to develop software which can process Chinese article effectively, we need to design special algorithm and script with a database to achieve this goal. Yet, we found that very few word retrieval tools exist for Chinese so we decided to work on it.

We believe that by developing software to analyze the pattern and usage of characters and words in daily use, we will be able to produce a lot of meaningful for subsequent studies like in the cultural area.

Project Objective

The ultimate goal of the project is to develop an online Chinese words analyzation tool which can display the statistic data (e.g. frequency of use, words relationship, domain origin, etc.) of each Chinese word.

To achieve this, we have the following sub-objectives:

As the core of our project, we will design a segmentation algorithm that can effectively break down a Chinese essay into individual words that represent the closest meaning.
We will develop a backend natural language processing tool that can using the segmentation algorithm to receive Chinese text from different systems and producing individual words from the text.
We will develop a backend analyzation tool that works with the database to calculate the frequency count, word relationship, sources of words, etc.
We will develop a web spider which is able to automatically fetch Chinese article / content from the Internet for subsequent analysis.
We will design a database which can store Chinese words, statistic associated to words and web content retrieved from the spider in an organized and effective way.
We will design a front-end webpage to allow users to enter the website links for analysis, including Chinese character or word usage, and also the pattern of an essay.

Project Proposal Interim Report Final Group Report

Project Schedule and Milestones

Milestone	Date of Completion	Deliverables
1. Project Initialization	10 July 2015	-
2. Project Analysis and Design	26 September 2015	-
3. Phase 1 Deliverables	4 October 2015	Project website Project plan
4. Raw Data Collection	10 January 2016	Implementation: Web spider Raw text preprocessing
5. Algorithm Design and Implementation	10 January 2016	Implementation: Word segmentation Word tagging and categorization Information extraction
6. Phase 2 Deliverables	11 January 2016	Preliminary Implementation Interim presentation materials Interim report
7. Database Design and Implementation	28 February 2016	Implementation: Structural database Word tree bank
8. Front-end Design andImplementation	31 March 2016	Implementation: Web app Word analysis toolbox
9. Phase 3 Deliverables	17 April 2016	Finalized Implementation Final report
10. Final Presentation	22 April 2016	Final presentation materials

Our Team

Supervisor: Dr. Vincent Lau

fyp15002@cs.hku.hk