Chinese language is one of the most commonly used languages in our world, which covers approximately 1.2 billion people all over world. In addition, it is used by the majority of people living in Hong Kong, Mainland China and Taiwan. In the age of Internet, more and more online Chinese media and social media platforms have been arisen such that we can find many articles or online discussions written in Chinese. We believe that these materials reflect the cultural values and the trend in the society which is valuable to be studied.
However, unlike English, Chinese is a language written without spaces between words. This characteristic makes software difficult to retrieve every single word from an article and conducts subsequent analysis. In order to develop software which can process Chinese article effectively, we need to design special algorithm and script with a database to achieve this goal. Yet, we found that very few word retrieval tools exist for Chinese so we decided to work on it.
We believe that by developing software to analyze the pattern and usage of characters and words in daily use, we will be able to produce a lot of meaningful for subsequent studies like in the cultural area.
The ultimate goal of the project is to develop an online Chinese words analyzation tool which can display the statistic data (e.g. frequency of use, words relationship, domain origin, etc.) of each Chinese word.
To achieve this, we have the following sub-objectives:
As the core of our project, we will design a segmentation algorithm that can effectively break down a Chinese essay into individual words that represent the closest meaning.
We will develop a backend natural language processing tool that can using the segmentation algorithm to receive Chinese text from different systems and producing individual words from the text.
We will develop a backend analyzation tool that works with the database to calculate the frequency count, word relationship, sources of words, etc.
We will develop a web spider which is able to automatically fetch Chinese article / content from the Internet for subsequent analysis.
We will design a database which can store Chinese words, statistic associated to words and web content retrieved from the spider in an organized and effective way.
We will design a front-end webpage to allow users to enter the website links for analysis, including Chinese character or word usage, and also the pattern of an essay.
Milestone |
Date of Completion |
Deliverables |
---|---|---|
1. Project Initialization |
10 July 2015 |
- |
2. Project Analysis and Design |
26 September 2015 |
- |
3. Phase 1 Deliverables |
4 October 2015 |
Project website |
4. Raw Data Collection |
10 January 2016 |
Implementation:
|
5. Algorithm Design and Implementation |
10 January 2016 |
Implementation:
|
6. Phase 2 Deliverables |
11 January 2016 |
Preliminary Implementation |
7. Database Design and Implementation |
28 February 2016 |
Implementation:
|
8. Front-end Design andImplementation |
31 March 2016 |
Implementation:
|
9. Phase 3 Deliverables |
17 April 2016 |
Finalized Implementation |
10. Final Presentation |
22 April 2016 |
Final presentation materials |