PAKDD 2014 Tutorial

Managing the Quality of Crowdsourced Databases

Reynold Cheng and Yudian Zheng
{ckcheng, ydzheng2}
The University of Hong Kong


The tutorial has been held on May 16th, 8:30-10:30 in the East Gate Room (B1), Shangri-La's Far Eastern Plaza Hotel, Tainan, Taiwan, PAKDD conference.

Crowdsourcing systems, such as the Amazon Mechanical Turk (AMT), CrowdFlowers, and Facebook, have attracted a lot of interest from the academia and industry. On these Internet-based platforms, a human worker performs jobs such as rating a physician, com-menting a product, and translating a sentence. These tasks are often difficult for a computer but easier for a human. Due to the growth of the Internet, a large amount of human-provided (or crowdsourced) information has been obtained, which enables interesting applications like product recommendation and spam detection.

In the first part of the tutorial, we first introduce the concepts of crowdsourcing by giving several successful applications in different fields, such as CV, OCR, NLP and so on. Based on these successful applications, we observed that the crowdsourced tasks can be viewed in two dimensions: the format of tasks and the nature of crowdsourced tasks. The format of tasks can be classified into three categories: binary choice question (BCQ), multiple choice question (MCQ) and open question, while the nature of tasks can be classified into two categories: question with ground truth and quesiton without ground turth. Then we give serveral examples that can correspond to specific category in these two dimentsion.

In the second part of the tutorial, we summarize the general architecture of crowdsourcing applications, which consist of three components: answer integration, task assignment and database storing crowdsourced data. We demonstrate two example applications (our work respectively published in ICDE'13 and CIKM'13) corresponding to specific categories based on the two dimensional classification of crowdsourced tasks. After introducing our two example work, we summarize how they implement different components in our proposed general architecture. At last we talk about three challenges in crowdsourcing.

The objective of the tutorial is for the reader to know the concept of crowdsourcing, and how to use it in famous crowdsourcing platforms (i.e., Amazon Mechanical Turk). We explain why crowdsourced database can be used to store the crowdsourced information and list several challenges in the future.

Tutorial Outline [Slides]

I Introduction (10 mins)

  • Overview of crowdsourcing and its applications

II Crowdsourcing Systems (10 mins)

  • Introduction to crowdsourcing platforms and databases

III Quality of Crowdsourced Applications (70 mins)

  • A classification of crowdsourcing tasks
  • Human effort optimization for multiple choice questions
  • Human effort optimisation for tagging tasks

V Conclusions and Q&A (15 mins)


Reynold Cheng is an Associate Professor of the Department of Computer Science in the University of Hong Kong. He was an Assistant Professor in HKU in 2008-11. He received his BEng (Computer Engineering) in 1998, and MPhil (Computer Science and Information Systems) in 2000, from the Department of Computer Science in the University of Hong Kong. He then obtained his MSc and PhD from Department of Computer Science of Purdue University in 2003 and 2005 respectively. Dr. Cheng was an Assistant Professor in the Department of Computing of the Hong Kong Polytechnic University during 2005-08. He was a visiting scientist in the Institute of Parallel and Distributed Systems in the University of Stuttgart during the summer of 2006. He was granted an Outstanding Young Researcher Award 2011-12 by HKU. He was the recipient of the 2010 Research Output Prize in the Department of Computer Science of HKU. He also received the U21 Fellowship in 2011. He received the Performance Reward in years 2006 and 2007 awarded by the Hong Kong Polytechnic University. He is the Chair of the Department Research Postgraduate Committee, and is the Vice Chairperson of the ACM (Hong Kong Chapter). He is a member of the IEEE, the ACM, the Special Interest Group on Management of Data (ACM SIGMOD), the UPE (Upsilon Pi Epsilon Honor Society). He is also a guest editor for a special issue in TKDE. He is a keynote speaker in the First International Workshop on Quality of Context (QuaCon ’09). He received an Outstanding Service Award in the CIKM 2009 conference. He has served as PC members and reviewer for international conferences and journals including TODS, TKDE, TMC, VLDBJ, IS, DKE, KAIS, VLDB, ICDE, ICDM, DEXA and DASFAA.

Yudian Zheng is a 1st year Ph.D. student under the supervision of Dr. Reynold Cheng. He graduated from Nanjing University in July, 2013. In his undergraduate study, he joined the LAMDA group and was supervised by Dr. Yang Yu to research on boosting algorithms from March, 2012 to March 2013. And he finished his crowdsourcing project under the supervision of Jiannan Wang and Dr. Guoliang Li in Tsinghua University from March 2013 to September 2014. In the PAKDD tutorial, he assisted Dr. Reynold Cheng in finishing the powerpoints.

Sample References

1. X. S. Yang, D. W. Cheung, L. Mo, R. Cheng, and B. Kao. On incentive-based tagging. In Proc. of ICDE, pages 685-696. 2013
2. Siyu Lei, Xuan S. Yang, Luyi Mo, Silviu Maniu, Reynold Cheng iTag: Incentive-Based Tagging. ICDE 2014 demo.
3. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD Conference, pages 61-72, 2011.
4. Ipeirotis, P. G. (2010a). Analyzing the Amazon Mechanical Turk marketplace. ACM XRDS, 17, 16–21.
5. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: A crowdsourcing data analytics system.PVLDB, 5(10):1040-1051, 2012.