Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6787
Joseph A. Konstan Ricardo Conejo José L. Marzo Nuria Oliver (Eds.)
User Modeling, Adaption, and Personalization 19th International Conference, UMAP 2011 Girona, Spain, July 11-15, 2011 Proceedings
13
Volume Editors Joseph A. Konstan University of Minnesota Department of Computer Science and Engineering 4-192 Keller Hall Minneapolis, MN 55455, USA E-mail:
[email protected] Ricardo Conejo Universidad de Málaga, E.T.S. Ing. Informatica Boulevard Lois Pasteur, 35, 29071 Malaga, Spain E-mail:
[email protected] José L. Marzo University of Girona, EPS Edifici P-IV (D.208) Campus de Montilivi, 17071 Girona, Spain E-mail:
[email protected] Nuria Oliver Telefonica Research R& D, Via Augusta, 177 Barcelona, Catalonia 08021, Spain E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-22361-7 e-ISBN 978-3-642-22362-4 DOI 10.1007/978-3-642-22362-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011930852 CR Subject Classification (1998): H.3, I.2, H.4, H.5, C.2, I.4 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The 19th International Conference on User Modeling, Adaptation and Personalization (UMAP 2011) took place in Girona, Spain, during July 11–15, 2011. It was the third annual conference under the UMAP title, which resulted from the merger in 2009 of the successful biannual User Modeling (UM) and Adaptive Hypermedia (AH) conference series. Over 700 researchers from 45 countries were involved in creating the technical program, either as authors or as reviewers. The Research Paper Track of the conference was chaired by Joseph A. Konstan from the University of Minnesota, USA, and Ricardo Conejo from the Universidad de M´ alaga, Spain. They were assisted by an international Program Committee of 80 leading figures in the AH and UM communities as well as highly promising younger researchers. Papers in the Research Paper Track were reviewed by three or more reviewers, with disagreements resolved through discussion among the reviewers, and the summary opinion reported in a meta review. The conference solicited Long Research Papers of up to 12 pages in length, which represent original reports of substantive new research. In addition, the conference solicited Short Research Papers of up to 6 pages in length, whose merit was assessed more in terms of originality and importance than maturity and technical validation. The Research Paper Track received 164 submissions, with 122 in the long and 42 in the short paper category. Of these, 27 long and 6 short papers were accepted, resulting in an acceptance rate of 22.13% for long papers, 14.29% for short papers, and 20.12% overall. Many authors of rejected papers were encouraged to revise their work and to resubmit it to conference workshops or to the Poster and Demo Tracks of the conference. The Industry Paper Track was chaired by Enrique Frias-Martinez, from Telefonica Research in Spain, and Marc Torrens, from Strands Labs in Spain. This track covered innovative commercial implementations or applications of UMAP technologies, and experience in applying recent research advances in practice. Submissions to this track were reviewed by a separate Industry Paper Committee with 14 leading industry researchers and practitioners. Of 8 submissions that were received, 3 were accepted. The conference also included a Doctoral Consortium, a forum for PhD students to get feedback and advice from a Doctoral Consortium Committee of 17 leading UMAP researchers. The Doctoral Consortium was chaired by Julita Vassileva from the University of Saskatchewan, Canada, and Liliana Ardissono, Universit´ a degli Studi di Torino, Italy. This track received 27 submissions of which 15 were accepted. The traditional Poster and Demo Session of the conference was chaired by Slivia Baldiris, Universitat de Girona, Spain, and Nicola Henze, University of Hannover, Germany, and Fabian Abel, Delft University of Technology, The Netherlands. As of the time of writing, the late submission deadline is still 2
VI
Preface
months away, and hence the number of acceptances is still unknown. We expect that this session will again have featured dozens of lively posters and system demonstrations. Summaries of these presentations will be published in online adjunct proceedings. The UMAP 2011 program also included Workshops and Tutorials that were selected by Chairs Tsvi Kuflik, University of Haifa, Israel, and Liliana Ardissono, Universit´ a degli Studi di Torino, Italy. The following tutorials were offered: – Personalization, Persuasion, and Everything In-between, taught by Shlomo Berkovsky and Jill Freyne – Designing Adaptive Social Applications, taught by Julita Vassileva and Jie Zhang – Designing and Evaluating New-Generation User Modeling, taught by Frederica Cena and Cristina Gena And the following workshops were organized: – SASWeb: Semantic Adaptive Social Web, chaired by Frederica Cena, Antonina Dattolo, Ernesto William De Luca, Pasquale Lops, Till Plumbaum and Julita Vassileva – PALE: Personalization Approaches in Learning Environments, chaired by Alexander Nussbaumer, Diana P´erez-Marin, Effie Law, Jesus G. Boticario, Milos Kravcik, Noboru Matsuda, Olga C. Santos and Susan Bull – DEMRA: Decision Making and Recommendation Acceptance Issues in Recommender Systems, chaired by Francesco Ricci, Giovanni Semeraro, Marco de Gemmis and Pasquale Lops – AUM: Augmenting User Models with Real-World Experiences to Enhance Personalization and Adaptation, chaired by Fabian Abel, Vania Dimitrova, Eelco Herder and Geert-Jan Houbert – UMMS: User Models for Motivational Systems: The Affective and the Rational Routes to Persuasion, chaired by Floriana Grasso, Jaap Ham and Judith Masthoff – TRUM: Trust, Reputation and User Modeling, chaired by Julita Vassileva and Jie Zhang – ASTC: Adaptive Support for Team Collaboration, chaired by Alexandros Paramythis, Lydia Lau, Stavros Demetriadis, Manolis Tzagarakis and Styliani Kleanthouse – UMADR: User Modeling and Adaptation for Daily Routines: Providing Assistance to People with Special and Specific Needs, chaired by Estefania Martin, Pablo A. Haya and Rosa M. Carro Finally, the conference also featured two invited talks. The invited speakers were: – Ricardo Baeza-Yates, Yahoo! Research, on the topic: “User Engagement: A Scientific Challenge” – Paul Resnick, University of Michigan, on the topic: “Does Personalization Lead to Echo Chambers?”
Preface
VII
In addition to all the contributors mentioned, we would also like to thank the Local Arrangements Chair Ramon Fabregat from the University of Girona, Spain, and the Publicity Chair Eelco Herder from L3S Research Center in Germany. We deeply acknowledge the conscientious work of the Program Committee members and the additional reviewers, who are listed on the next pages. The conference would not have been possible without the work of many “invisible” helpers. We also gratefully acknowledge our sponsors who helped us with funding and organizational expertise: User Modeling Inc., ACM SIGART, SIGCHI and SIGIR, the Chen Family Foundation, Microsoft Research, the U.S. National Science Foundation, Springer, Telefonica de Espa˜ na, and the University of Girona. Finally, we want to acknowledge the use of EasyChair for the management of the review process and the preparation of the proceedings, and the help of its administrator Andrei Voronkov in implementing system enhancements that this conference had commissioned. April 2011
Joseph A. Konstan Ricardo Conejo Jose L. Marzo Nuria Oliver
Organization
UMAP 2011 was organized by the Institute of Informatics and Applications, Universitat de Girona, Spain, in cooperation in cooperation with User Modeling Inc., ACM/SIGIR, ACM/SIGCHI and ACM/SIGART. The conference took place during July 11–15, 2011 at Centre Cultural la Merc`e, City Hall of Girona, Spain.
Organizing Committee General Co-chairs Jose L. Marzo Nuria Oliver
University of Girona, Spain Telefonica Research, Spain
Program Co-chairs Joseph A. Konstan Ricardo Conejo
University of Minnesota, USA Universidad de M´alaga, Spain
Industry Track Co-chairs Enrique Frias-Martinez Marc Torrens
Telefonica Research, Spain Strands Labs, Spain
Workshop and Tutorial Co-chairs Tsvi Kuflik Liliana Ardissono
University of Haifa, Israel Universit` a degli Studi di Torino, Italy
Doctoral Consortium Co-chairs Julita Vassileva University of Saskatchewan, Canada Peter Brusilovsky University of Pittsburgh, USA Demo and Poster Co-chairs Silvia Baldiris Universitat de Girona, Spain Nicole Henze University of Hannover, Germany Fabian Abel Delft University of Technology, The Netherlands Local Arrangements Chair Ramon Fabregat
University of Girona, Spain
Publicity Chair Eelco Herder
L3S Research Center, Germany
X
Organization
Research Track Program Committee Kenro Aihara Sarabjot Anand Liliana Ardissono Lora Aroyo Helen Ashman Ryan Baker Mathias Bauer Joseph Beck Shlomo Berkovsky Maria Bielikov´ a Peter Brusilovsky Robin Burke Sandra Carberry Rosa Carro David Chin Cristina Conati Owen Conlan Albert Corbett Dan Cosley Alexandra Cristea Hugh Davis Paul De Bra Vania Dimitrova Peter Dolog Ben Du Boulay Enrique Frias-Martinez Eduardo Guzman Neil Heffernan Nicola Henze Eelco Herder Haym Hirsh Ulrich Hoppe Geert-Jan Houben Dietmar Jannach Lewis Johnson Judy Kay Alfred Kobsa Birgitta Konig-Ries Tsvi Kuflik Paul Lamere James Lester Henry Lieberman Frank Linton
National Institute of Informatics, Japan University of Warwick, UK Universit` a di Torino, Italy University of Amsterdam, The Netherlands University of South Australia Carnegie Mellon University, USA Mineway GmbH, Germany Worcester Polytechnic Institute, USA CSIRO, Australia Slovak University of Technology, Slovakia University of Pittsburgh, USA DePaul University, USA University of Delaware, USA Universidad Autonoma de Madrid, Spain University of Hawaii, USA University of British Columbia, Canada University of Dublin, Ireland Carnegie Mellon University, USA Cornell University, USA University of Warwick, UK University of Southampton, UK Eindhoven University of Technology, The Netherlands University of Leeds, UK Aalborg University, Denmark University of Sussex, UK Telefonica Research, Spain Universidad de M´alaga, Spain Worcester Polytechnic Institute University of Hannover, Germany L3S Research Center, Hannover, Germany National Science Foundation University Duisburg-Essen, Germany Delft University of Technology, The Netherlands Dortmund University of Technology, Germany Alelo, Inc., USA University of Sydney, Australia University of California, Irvine, USA University of Karlsruhe, Germany University of Haifa, Israel Sun Laboratories, UK North Carolina State University, USA MIT, USA The MITRE Corporation, USA
Organization
Paul Maglio Brent Martin Judith Masthoff Mark Maybury Gordon McCalla Lorraine McGinty Alessandro Micarelli Eva Millan Bamshad Mobasher Yoichi Motomura Michael O’Mahony Jose Luis Perez de la Cruz Francesco Ricci John Riedl Jes´ us Gonz´alez Boticario Cristobal Romero Lloyd Rutledge Melike Sah Olga Santos Daniel Schwabe Frank Shipman Carlo Tasso Loren Terveen Julita Vassileva Vincent Wade Gerhard Weber Massimo Zancanaro Diego Zapata-Rivera
IBM Research Center, USA University of Canterbury, New Zealand University of Aberdeen, UK MITRE, USA University of Saskatchewan, Canada UCD Dublin, Ireland Roma Tre University, Italy Universidad de M´alaga, Spain DePaul University, USA National Institute of Advanced Industrial Science and Technology, Japan UCD Dublin, Ireland M´alaga University, Spain Free University of Bolzan-Bolzano, Italy University of Minnesota, USA UNED, Spain University of Cordoba, Spain Open Universiteit Nederland, The Netherlands Trinity College, Dublin, Ireland UNED, Spain Pontif´ıcia Universidade Cat´ olica do Rio de Janeiro, Brazil Texas A&M University, USA University of Udine, Italy University of Minnesota, USA University of Saskatchewan, Canada Trinity College Dublin, Ireland HCI Research Group, Germany Bruno Kessler Foundation, Italy ETS, USA
Industry Track Program Committee Mauro Barbieri Mathias Bauer Vanessa Frias-Martinez Werner Geyer Gustavo Gonzalez-Sanchez Maxim Gurevich Ido Guy Heath Hohwald Jose Iria Ashish Kapoor
Philips Research, The Netherlands Mineway, Germany Telefonica Research, Madrid, Spain IBM Research, Cambridge, MA, USA Mediapro R&D, Spain Yahoo Research, USA IBM Research, Haifa, Israel Telefonica Research, Madrid, Spain IBM Research, Zurich, Switzerland Microsoft Research, Redmond, USA
XI
XII
Organization
George Magoulas Bhaskar Mehta David R. Millen Qiankun Zhao
Birkbeck College, London, UK Google, Zurich, Switzerland IBM Research, Cambridge, USA Audience Touch, Beijing, China
Doctoral Consortium Committee Sarabjot Anand Shlomo Berkovsky M´ aria Bielikov´a Peter Brusilovsky Susan Bull Robin Burke Federica Cena David Chin Alexandra Cristea Antonina Dattolo Vania Dimitrova Peter Dolog Benedict Du Boulay Cristina Gena Floriana Grasso Jim Greer Eelco Herder Indratmo Indratmo Dietmar Jannach Judith Masthoff Cecile Paris Liana Razmerita Katharina Reinecke Marcus Specht Julita Vassileva Stephan Weibelzahl
University of Warwick, UK CSIRO, Australia Slovak University of Technology, Slovakia University of Pittsburgh, USA University of Birmingham, UK DePaul University, USA Universit´ a degli Studi di Torino, Italy University of Hawaii, USA University of Warwick, UK Universit` a di Udine, Italy University of Leeds, UK Aalborg University, Denmark University of Sussex, UK Universit` a di Torino, Italy University of Liverpool, UK University of Saskatchewan, Canada L3S Research Center, Hannover, Germany Grant MacEwan University, Canada Dortmund University of Technology, Germany University of Aberdeen, UK CSIRO, Australia Copenhagen Business School, Denmark University of Zurich, Switzerland University of The Netherlands University of Saskatchewan, Canada National College of Ireland
Additional Reviewers Fabian Abel Mohd Anwar Michal Barla Claudio Biancalana Kristy Elizabeth Boyer Janez Brank Christopher Brooks Stefano Burigat Fabio Buttussi Alberto Cabas Vidani
Delft University of Technology, The Netherlands University of Saskatchewan, Canada Slovak University of Technology, Slovakia University of Rome III, Italy North Caroline State University, USA J. Stefan Institute, Slovenia University of Saskatchewan, Canada University of Udine, Italy University of Udine, Italy University of Udine, Italy
Organization
Annalina Caputo Giuseppe Carenini Francesca Carmagnola Carlos Castro-Herrera Fei Chen Karen Church Ilaria Corda Theodore Dalamagas Lorand Dali Carla Delgado-Battenfeld Eyal Dim Frederico Durao Jill Freyne Fabio Gasparetti Mouzhi Ge Fatih Gedikli Jonathan Gemmell Yaohua Ho Daniel Krause Joel Lanir Danielle H. Lee I-Fan Liu Pasquale Lops Kevin McCarthy Cataldo Musto Inna Novalija Ildiko Pelczer Jonathan Rowe Giuseppe Sansonetti Tom Schimoler Giovanni Semeraro Yanir Seroussi Marian Simko Jan Suchal Nenad Tomasev Giannis Tsakonas Michal Tvarozek Dimitrios Vogiatzis Alan Wecker Michael Yudelson
University of Bari, Italy University of British Columbia, Canada University of Leeds, UK DePaul University, USA Avaya Labs, USA University College Dublin, Ireland University of Leeds, UK Ionian University, Greece J. Stefan Institute, Slovenia TU Dortmund, Germany University of Haifa, Israel Aalborg University, Denmark University College Dublin, Ireland University of Rome III, Italy TU Dortmund, Germany TU Dortmund, Germany DePaul University, USA NTU, Taiwan L3S Research Center Hannover, Germany University of Haifa, Israel University of Pittsburgh, USA NTU, Taiwan University of Bari, Italy University College Dublin, Ireland University of Bari, Italy J. Stefan Institute, Slovenia Ecole Polytechnic de Montreal, Canada North Caroline State University, USA University of Rome III, Italy DePaul University, USA University of Bari, Italy Monash University, Australia Slovak University of Technology, Slovakia Slovak University of Technology, Slovakia J. Stefan Institute, Slovenia Ionian University, Greece Slovak University of Technology, Slovakia NCSR Demokritos, Greece University of Haifa, Israel University of Pittsburgh, USA
XIII
Table of Contents
Full Research Papers Analyzing User Modeling on Twitter for Personalized News Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryan S.J.D. Baker, Zachary A. Pardos, Sujith M. Gowda, Bahador B. Nooraei, and Neil T. Heffernan
1
13
Creating Personalized Digital Human Models Of Perception For Visual Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mike Bennett and Aaron Quigley
25
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring: The Case of Avoiding Bad Annotations of Learning Objects . . . . . . . . . . . John Champaign, Jie Zhang, and Robin Cohen
38
Modeling Mental Workload Using EEG Features for Intelligent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maher Chaouachi, Im`ene Jraidi, and Claude Frasson
50
Context-Dependent Feedback Prioritisation in Exploratory Learning Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mihaela Cocea and George D. Magoulas
62
Performance Comparison of Item-to-Item Skills Models with the IRT Single Latent Trait Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michel C. Desmarais
75
Hybrid User Preference Models for Second Life and OpenSimulator Virtual Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joshua Eno, Gregory Stafford, Susan Gauch, and Craig W. Thompson Recipe Recommendation: Accuracy and Reasoning . . . . . . . . . . . . . . . . . . . Jill Freyne, Shlomo Berkovsky, and Gregory Smith Tag-Based Resource Recommendation in Social Annotation Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Gemmell, Thomas Schimoler, Bamshad Mobasher, and Robin Burke
87
99
111
XVI
Table of Contents
The Impact of Rating Scales on User’s Rating Behavior . . . . . . . . . . . . . . . Cristina Gena, Roberto Brogi, Federica Cena, and Fabiana Vernero
123
Looking Beyond Transfer Models: Finding Other Sources of Power for Student Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Gong and Joseph E. Beck
135
Using Browser Interaction Data to Determine Page Reading Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Hauger, Alexandros Paramythis, and Stephan Weibelzahl
147
A User Interface for Semantic Competence Profiles . . . . . . . . . . . . . . . . . . . Martin Hochmeister and Johannes Daxb¨ ock Open Social Student Modeling: Visualizing Student Models with Parallel IntrospectiveViews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I-Han Hsiao, Fedor Bakalov, Peter Brusilovsky, and Birgitta K¨ onig-Ries
159
171
Location-Adapted Music Recommendation Using Tags . . . . . . . . . . . . . . . . Marius Kaminskas and Francesco Ricci
183
Leveraging Collaborative Filtering to Tag-based Personalized Search . . . . Heung-Nam Kim, Majdi Rawashdeh, and Abdulmotaleb El Saddik
195
Modelling Symmetry of Activity as an Indicator of Collocated Group Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Martinez, Judy Kay, James R. Wallace, and Kalina Yacef A Dynamic Sliding Window Approach for Activity Recognition . . . . . . . . Javier Ortiz Laguna, Angel Garc´ıa Olaya, and Daniel Borrajo
207
219
Early Detection of Potential Experts in Question Answering Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aditya Pal, Rosta Farzan, Joseph A. Konstan, and Robert E. Kraut
231
KT-IDEM: Introducing Item Difficulty to the Knowledge Tracing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zachary A. Pardos and Neil T. Heffernan
243
Walk the Talk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Denis Parra and Xavier Amatriain
255
Finding Someone you will Like and who Won’t Reject You . . . . . . . . . . . . Luiz Augusto Pizzato, Tomek Rej, Kalina Yacef, Irena Koprinska, and Judy Kay
269
Table of Contents
Personalizing the Theme Park: Psychometric Profiling and Physiological Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Rennick-Egglestone, Amanda Whitbrook, Caroline Leygue, Julie Greensmith, Brendan Walker, Steve Benford, Holger Schn¨ adelbach, Stuart Reeves, Joe Marshall, David Kirk, Paul Tennent, Ainoje Irune, and Duncan Rowland Recognising and Recommending Context in Social Web Search . . . . . . . . Zurina Saaya, Barry Smyth, Maurice Coyle, and Peter Briggs Tags as Bridges between Domains: Improving Recommendation with Tag-Induced Cross-Domain Collaborative Filtering . . . . . . . . . . . . . . . . . . . Yue Shi, Martha Larson, and Alan Hanjalic User Modeling – A Notoriously Black Art . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Yudelson, Philip I. Pavlik Jr., and Kenneth R. Koedinger
XVII
281
293
305
317
Short Research Papers Selecting Items of Relevance in Social Network Feeds . . . . . . . . . . . . . . . . . Shlomo Berkovsky, Jill Freyne, Stephen Kimani, and Gregory Smith Enhancing Traditional Local Search Recommendations with Context-Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claudio Biancalana, Andrea Flamini, Fabio Gasparetti, Alessandro Micarelli, Samuele Millevolte, and Giuseppe Sansonetti Gender Differences and the Value of Choice in Intelligent Tutoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derek T. Green, Thomas J. Walsh, Paul R. Cohen, Carole R. Beal, and Yu-Han Chang Towards Understanding How Humans Teach Robots . . . . . . . . . . . . . . . . . . Tasneem Kaochar, Raquel Torres Peralta, Clayton T. Morrison, Ian R. Fasel, Thomas J. Walsh, and Paul R. Cohen Towards Open Corpus Adaptive Hypermedia: A Study of Novelty Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi-ling Lin and Peter Brusilovsky 4MALITY: Coaching Students with Different Problem-Solving Strategies Using an Online Tutoring System . . . . . . . . . . . . . . . . . . . . . . . . . Leena Razzaq, Robert W. Maloy, Sharon Edwards, David Marshall, Ivon Arroyo, and Beverly P. Woolf
329
335
341
347
353
359
XVIII
Table of Contents
Long Industry Papers Capitalizing on Uncertainty, Diversity and Change by Online Individualization of Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reza Razavi
365
Prediction of Socioeconomic Levels Using Cell Phone Records . . . . . . . . . Victor Soto, Vanessa Frias-Martinez, Jesus Virseda, and Enrique Frias-Martinez
377
User Perceptions of Adaptivity in an Interactive Narrative . . . . . . . . . . . . Karen Tanenbaum, Marek Hatala, and Joshua Tanenbaum
389
Doctoral Consortium Papers Performance Prediction in Recommender Systems . . . . . . . . . . . . . . . . . . . . Alejandro Bellog´ın Multi-perspective Context Modelling to Augment Adaptation in Simulated Learning Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dimoklis Despotakis
401
405
Situation Awareness in Neurosurgery: A User Modeling Approach . . . . . . Shahram Eivazi
409
Adaptive Active Learning in Recommender Systems . . . . . . . . . . . . . . . . . . Mehdi Elahi
414
Extending Sound Sample Descriptions through the Extraction of Community Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frederic Font and Xavier Serra
418
Monitoring Contributions Online: A Reputation System to Model Expertise in Online Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thieme Hennis
422
Student Procedural Knowledge Inference through Item Response Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Hernando
426
Differences on How People Organize and Think About Personal Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matjaˇz Kljun
430
Towards Contextual Search: Social Networks, Short Contexts and Multiple Personas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom´ aˇs Kram´ ar
434
Using Folksonomies for Building User Interest Profile . . . . . . . . . . . . . . . . . Harshit Kumar and Hong-Gee Kim
438
Table of Contents
Designing Trustworthy Adaptation on Public Displays . . . . . . . . . . . . . . . . Ekaterina Kurdyukova
XIX
442
Using Adaptive Empathic Responses to Improve Long-Term Interaction with Social Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iolanda Leite
446
A User Modeling Approach to Support Knowledge Work in Socio-computational Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karin Schoefegger
450
Modeling Individuals with Learning Disabilities to Personalize a Pictogram-Based Instant Messaging Service . . . . . . . . . . . . . . . . . . . . . . . . . Pere Tuset-Peir´ o
454
FamCHAI: An Adaptive Calendar Dialogue System . . . . . . . . . . . . . . . . . . Ramin Yaghoubzadeh
458
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
463
Analyzing User Modeling on Twitter for Personalized News Recommendations Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao Web Information Systems, Delft University of Technology Mekelweg 4, 2628 Delft, the Netherlands {f.abel,q.gao,g.j.p.m.houben,k.tao}@tudelft.nl
Abstract. How can micro-blogging activities on Twitter be leveraged for user modeling and personalization? In this paper we investigate this question and introduce a framework for user modeling on Twitter which enriches the semantics of Twitter messages (tweets) and identifies topics and entities (e.g. persons, events, products) mentioned in tweets. We analyze how strategies for constructing hashtag-based, entity-based or topic-based user profiles benefit from semantic enrichment and explore the temporal dynamics of those profiles. We further measure and compare the performance of the user modeling strategies in context of a personalized news recommendation system. Our results reveal how semantic enrichment enhances the variety and quality of the generated user profiles. Further, we see how the different user modeling strategies impact personalization and discover that the consideration of temporal profile patterns can improve recommendation quality. Keywords: user modeling, twitter, semantics, personalization.
1
Introduction
With more than 190 million users and more than 65 million postings per day, Twitter is today the most prominent micro-blogging service available on the Web1 . People publish short messages (tweets) about their everyday activities on Twitter and lately researchers investigate feasibility of applications such as trend analysis [1] or Twitter-based early warning systems [2]. Most research initiatives study network structures and properties of the Twitter network [3,4]. Yet, little research has been done on understanding the semantics of individual Twitter activities and inferring user interests from these activities. As tweets are limited to 140 characters, making sense of individual tweets and exploiting tweets for user modeling are non-trivial problems. In this paper we study how to leverage Twitter activities for user modeling and evaluate the quality of user models in the context of recommending news articles. We develop a framework that enriches the semantics of individual Twitter activities and allows for the construction of different types of semantic 1
http://techcrunch.com/2010/06/08/twitter-190-million-users/
Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
F. Abel et al.
user profiles. The characteristics of these user profiles are influenced by different design dimensions and design alternatives. To better understand how those factors impact the characteristics and quality of the resulting user profiles, we conduct an in-depth analysis on a large Twitter dataset of more than 2 million tweets and answer research questions such as the following: how does the semantic enrichment impact the characteristics and quality of Twitter-based profiles (see Section 4.2)? How do (different types of) profiles evolve over time? Are there any characteristic temporal patterns (see Section 4.3)? How do the different user modeling strategies impact personalization (personalized news article recommendations) and does the consideration of temporal patterns improve the accuracy of the recommendations (see Section 5)? Before studying the above research questions in Section 4-5, we will summarize related work in Section 2 and introduce the design dimensions of Twitter-based user modeling as well as our Twitter user modeling framework in Section 3.
2
Related Work
With the launch of Twitter in 2007, micro-blogging became highly popular and researchers started to investigate Twitter’s information propagation patterns [3] or analyzed structures of the Twitter network to identify influential users [4]. Dong et al. [5] exploit Twitter to detect and rank fresh URLs that have possibly not been indexed by Web search engines yet. Lately, Chen et al. conducted a study on recommending URLs posted in Twitter messages and compare strategies for selecting and ranking URLs by exploiting the social network of a user as well as the general popularity of the URLs in Twitter [6]. Chen et al. do not investigate user modeling in detail, but represent Twitter messages of a user by means of a bag of words. In this paper we go beyond such representations and analyze different types of profiles like entity-based or hashtag-based profiles. Laniado and Mika introduce metrics to describe the characteristics of hashtags – keywords starting with “#” – such as frequency, specificity or stability over time [7]. Huang et al. further characterize the temporal dynamics of hashtags via statistical measures such as standard deviation and discover that some hashtags are used widely for a few days but then disappear quickly [8]. Recent research on collaborative filtering showed that the consideration of such temporal dynamics impacts recommendation quality significantly [9]. However, the impact of temporal characteristics of Twitter-based user profiles on recommendation performance has not been researched yet. Neither hashtag-based nor bag-of-words representation explicitly specify the semantics of tweets. To better understand the semantics of Twitter messages published during scientific conferences, Rowe et al. [10] map tweets to conference talks and exploit metadata of the corresponding research papers to enrich the semantics of tweets. Rowe et al. mention user profiling as one of the applications that might benefit from such semantics, but do not further investigate user modeling on Twitter. In this paper we close this gap and present the first largescale study on user modeling based on Twitter activities and moreover explore how different user models impact the accuracy of recommending news articles.
Analyzing User Modeling on Twitter for Personalized News Recommendations
3
Table 1. Design space of Twitter-based user modeling strategies design dimension
design alternatives (discussed in this paper)
profile type
(i) hashtag-based, (ii) topic-based or (iii) entity-based
enrichment
(i) tweet-only-based enrichment or (ii) linkage and exploitation of external news articles (propagating entities/topics)
temporal constraints
(i) specific time period(s), (ii) temporal patterns (weekend, night, etc.) or (iii) no constraints
3
Twitter-Based User Modeling
The user modeling strategies proposed and discussed in this paper vary in three design dimensions: (i) the type of profiles created by the strategies, (ii) the data sources exploited to further enrich the Twitter-based profiles and (iii) temporal constraints that are considered when constructing the profiles (see Table 1). The generic model for profiles representing users is specified in Definition 1. Definition 1 (User Profile). The profile of a user u ∈ U is a set of weighted concepts where with respect to the given user u for a concept c ∈ C its weight w(u, c) is computed by a certain function w. P (u) = {(c, w(u, c))|c ∈ C, u ∈ U } Here, C and U denote the set of concepts and users respectively. In particular, following Table 1 we analyze three types of profiles that differ with respect to the type of concepts C: entity-, topic- and hashtag-based profiles – denoted by PE (u), PT (u) and PH (u) respectively. We apply occurrence frequency as weighting scheme w(u, c), which means that the weight of a concept is determined by the number of Twitter activities in which user u refers to concept c. For example, in a hashtag-based profile w(u, #technology) = 5 means that u published five Twitter messages that mention “#technology”. We further normalize user profiles so that the sum of all weights in a profile is equal to 1: ci ∈C w(u, ci ) = 1. With p(u) we refer to P (u) in its vector space model representation, where the value of the i-th dimension refers to w(u, ci ). The user modeling strategies we analyze in this paper exploit Twitter messages posted by a user u to construct the corresponding profile P (u). When constructing entity- and topic-based user profiles, we also investigate the impact of further enrichment based on the exploitation of external data sources (see Table 1). In particular, we allow for enrichment with entities and topics extracted from news articles that are linked with Twitter messages (news-based enrichment). In previous work [11] we presented strategies for selecting appropriate news articles for enriching users’ Twitter activities. A third dimension we investigate in the context of Twitter-based user modeling is given by temporal constraints that are considered when constructing the
4
F. Abel et al.
profiles (see Table 1). First, we study the nature of user profiles created within specific time periods. For example, we compare profiles constructed by exploiting the complete (long-term) user history with profiles that are based only on Twitter messages published within a certain week (short-term). Second, we examine certain time frames for creating the profiles. For example, we explore the differences between user profiles created on the weekends with those created during the week to detect temporal patterns that might help to improve personalization within certain time frames. By selecting and combining the different design dimensions and alternatives we obtain a variety of different user modeling strategies that will be analyzed and evaluated in this paper. 3.1
Twitter-Based User Modeling Framework
We implemented the profiling strategies as a Twitter-based user modeling framework that is available via the supporting website of this paper [12]. Our framework features three main components: 1. Semantic Enrichment. Given the content of Twitter messages we extract entities and topics to better understand the semantics of Twitter activities. Therefore we utilize OpenCalais2 , which allows for the detection and identification of 39 different types of entities such as persons, events, products or music groups and moreover provides unique URIs for identified entities as well as for the topics so that the meaning of such concepts is well defined. 2. Linkage. We implemented several strategies that link tweets with external Web resources and news articles in particular. Entities and topics extracted from the articles are then propagated to the linked tweets. In [11] we showed that for tweets which do not contain any hyperlink the linking strategies identify related news articles with an accuracy of 70-80%. 3. User Modeling. Based on the semantic enrichment and the linkage with external news articles, our framework provides methods for generating hashtagbased, entity-based, and topic-based profiles that might adhere to specific temporal constraints (see above).
4
Analysis of Twitter-Based User Profiles
To understand how the different user modeling design choices influence the characteristics of the generated user profiles, we applied our framework to conduct an in-depth analysis on a large Twitter dataset. The main research questions to be answered in this analysis can be summarized as follows. 1. How do the different user modeling strategies impact the characteristics of Twitter-based user profiles? 2. Which temporal characteristics do Twitter-based user profiles feature? 2
http://www.opencalais.com
Tweet-only Tweet+News-based enrichment
1000 100 10 1 0 1
10
100
1000
user profiles
(a) Entity-based profiles
10
Tweet-only Tweet+News-based enrichment
10000
1
0 1
10
100
number of items per user profile
10000
distinct topics per user profile
distinct entities per user profile
Analyzing User Modeling on Twitter for Personalized News Recommendations
1000
hashtag-based entity-based (enriched) topic-based (enriched)
1000 100 10 1 0 1
10
100
1000
user profiles
user profiles
(b) Topic-based profiles
5
(c) Comparison of different types of profiles
Fig. 1. Comparison between different user modeling strategies with tweet-only-based or news-based enrichment
4.1
Data Collection and Data Set Characteristics
Over a period of more than two months we crawled Twitter information streams of more than 20,000 users. Together, these people published more than 10 million tweets. To allow for linkage of tweets with news articles we also monitored more than 60 RSS feeds of prominent news media such as BBC, CNN or New York Times and aggregated the content of 77,544 news articles. The number of Twitter messages posted per user follows a power-law distribution. The majority of users published less than 100 messages during our observation period while only a small fraction of users wrote more than 10,000 Twitter messages and one user produced even slightly more than 20,000 tweets (no spam). As we were interested in analyzing also temporal characteristics of the user profiles, we created a sample of 1619 users, who contributed at least 20 tweets in total and at least one tweet in each month of our observation period. This sample dataset contained 2,316,204 tweets in total. We processed each Twitter message and each news article via the semantic enrichment component of our user modeling framework to identify topics and entities mentioned in the the tweets and articles (see Section 3.1). Further, we applied two different linking strategies and connected 458,566 Twitter messages with news articles of which 98,189 relations were explicitly given in the tweets by URLs that pointed to the corresponding news article. The remaining 360,377 relations were obtained by comparing the entities that were mentioned in both news articles and tweets as well as by comparing the timestamps. In previous work we showed that this method correlates news and tweets with an accuracy of more than 70% [11]. Our hypothesis is that – regardless whether this enrichment method might introduce a certain degree of noise – it impacts the quality of user modeling and personalization positively. 4.2
Structural Analysis of Twitter-Based Profiles
To validate our hypothesis and explore how the exploitation of linked external sources influences the characteristics of the profiles generated by the different user modeling strategies, we analyzed the corresponding profiles of the 1619 users
6
F. Abel et al.
from our sample. In Figure 1 we plot the number of distinct (types of) concepts in the topic- and entity-based profiles and show how this number is influenced by the additional news-based enrichment. For both types of profiles the enrichment with entities and topics obtained from linked news articles results in a higher number of distinct concepts per profile (see Fig. 1(a) and 1(b)). Topic-based profiles abstract much stronger from the concrete Twitter activities than entity-based profiles. In our analysis we utilized the OpenCalais taxonomy consisting of 18 topics such as politics, entertainment or culture. The tweet-only-based user modeling strategy, which exploits merely the semantics attached to tweets, fails to create profiles for nearly 100 users (6.2%, topic-based) as for these users none of the tweets can be categorized into a topic. By enriching the tweets with topics inferred from the linked news articles we better understand the semantics of Twitter messages and succeed in creating more valuable topic-based profiles for 99.4% of the users. Further, the number of profile facets, i.e. the type of entities (e.g. person, location or event) that occur in the entity-based profiles, increases with the newsbased semantic enrichment. While more than 400 twitter-based profiles (more than 25%) feature less than 10 profile facets and often miss entities such as movies or products a user is concerned with, the news-based enrichment detects a greater variety of entity types. For more than 99% of the entity-based profiles enriched via news articles, the number of distinct profile facets is higher than 10. A comparison of the entity- and topic-based user modeling strategies with the hashtag-based strategy (see Fig. 1(c)) shows that the variety of entity-based profiles is much higher than the one of hashtag-based profiles. While the entity-based strategy succeeds to create profiles for all users in our dataset, the hashtag-based approach fails for approximately 90 users (5.5%) as the corresponding people neither made use of hashtags nor re-tweeted messages that contain hashtags. Entity-based as well as topic-based profiles moreover make the semantics more explicit than hashtag-based profiles. Each entity and topic has a URI which defines the meaning of the entity and topic respectively. The advantages of well-defined semantics as exposed by the topic- and entitybased profiles also depend on the application context, in which these profiles are used. The results of the quantitative analysis depicted in Fig. 1 show that entity- and topic-based strategies allow for higher coverage regarding the number of users, for whom profiles can be generated, than the hashtag-based strategy. Further, semantic enrichment by exploiting news articles (implicitly) linked with tweets increases the number of entities and topics available in the profiles significantly and improves the variety of the profiles (the number of profile facets). 4.3
Temporal Analysis of Twitter-Based Profiles
In the temporal analysis we investigate (1) how the different types of user profiles evolve over time and (2) which temporal patterns occur in the profiles. Regarding temporal patterns we, for example, examine whether profiles generated on the weekends differ from those generated during the week. Similar
(a) Different profile types over time
$ # ! $ # !
7
& &
$ " % #
$ # ! $ # !
$ " % #
Analyzing User Modeling on Twitter for Personalized News Recommendations
(b) Profiles with/without news enrichment
Fig. 2. Temporal evolution of user profiles: average d1 -distance of current individual user profiles with corresponding profiles in the past
to the click-behavior analysis by Liu et al. [13], we apply the so-called d1 distance for measuring the difference between profiles in vector representation: d1 (px (u), py (u)) = i |px,i − py,i|. The higher d1 (px (u), py (u)) ∈ [0..2] the higher the difference of the two profiles px (u) and py (u) and if two profiles are the same then d1 (px (u), py (u)) = 0. Figure 2 depicts the evolution of profiles over time. It shows the average d1 distance of the current user profiles with the profiles of the same users created based on Twitter activities performed in a certain week in the past. As suggested in [13], we also plotted the distance of the current user-specific profile with the public trend (see Fig. 2(a)), i.e. the average profile of the corresponding weeks. For the three different profile types we observe that the d1 -distance slightly decreases over time. For example, the difference of current profiles (first week of January 2011) with the corresponding profiles generated at the beginning of our observation period (in the week around 18th November 2010) is the highest while the distance of current profiles with profiles computed one week before (30th December 2010) is the lowest. It is interesting to see that the distance of the current profiles with the public trend (i) is present for all types of profiles and (ii) is rather constant over time. This suggests (i) a certain degree of individualism in Twitter and (ii) reveals that the people in our sample follow different trends rather than being influenced by the same trends. Hashtag-based profiles exhibit the strongest changes over time as the average d1 -distance to the current profile is constantly higher than for the topicand entity-based profiles. Figure 2(b) discloses that entity-based profiles change stronger over time than topic-based profiles when news-based enrichment is enabled. When merely analyzing Twitter messages one would come to a different (possibly wrong) conclusion (see Fig. 2(a)). Figure 3 illustrates temporal patterns we detected when analyzing the individual user profiles. In particular, we investigate how profiles created on the weekends differ from profiles (of the same user) created during the week. For topic-based profiles generated solely based on Twitter messages, it seems that for some users the weekend and weekday profiles differ just slightly while for 24.9% of the users the d1 -distance of the weekend and weekday profile is maximal (2 is the maximum possible value, see Fig. 3(a)). The news-based enrichment reveals
2
~25% of the users
Tweet-only Tweet+News-based enrichment Average difference of the enriched profiles
1
0.1
1
10
100
1000 2000
difference between profiles (d1 distance)
F. Abel et al. difference between profiles (d1 distance)
8
Weekday vs. Weekday MonWedFri vs. TueThurSat 1
0.1
1
user profiles
difference between profiles (d1 distance)
difference between profiles (d1 distance)
1
0.1
100
1000
user profiles
(c) Weekday/weekend day/night difference
1000
(b) Weekday/weekend difference vs. difference between arbitrarily chosen days
Weekday vs. Weekend Day vs. Night
10
100
user profiles
(a) Weekday vs. weekend profiles
1
10
difference
1
0.1
Entity-based Topic-based Entity-Type-based Hashtag-based
0 1
10
100
1000
user profiles
vs.
(d) Different types of profiles
Fig. 3. Temporal patterns: comparison between weekend and weekday profiles by means of d1 -distance ((a)-(c): topic-based profiles)
however that the difference of weekend and weekday profiles is a rather common phenomenon: the curve draws nearer to the average difference (see dotted line); there are less extrema, i.e. users for whom the d1 -difference is either very low or very high. Hence, it rather seems that the tweets alone are not sufficient to get a clear understanding of the users concerns and interests. Fig. 3(b) further supports the hypothesis that weekend profiles differ significantly from weekday profiles. The corresponding distances d1 (pweekend (u), pweekday (u)) are consistently higher than the differences of profiles generated on arbitrarily chosen days during the week. This weekend pattern is more significant than differences between topic-based profiles generated based on Twitter messages that are either posted during the evening (6pm-3am) or during the day (9am-5pm) as shown in Fig. 3(c). Hence, the individual topic drift – i.e. change of topics individual users are concerned with – between day and evening/night seems to be smaller than between weekdays and weekends. The weekend pattern is coherent over the different types of profiles. Different profile types however imply different drift of interests or concerns between weekend and weekdays (see Fig. 3(d)). Hashtag-based and entity-based profiles change most while the types of entities people refer to (persons, products, etc.)
Analyzing User Modeling on Twitter for Personalized News Recommendations
9
do not differ that strongly. When zooming into the individual entity-based profiles we see that entities related to leisure time and entertainment become more important on the weekends. The temporal analysis thus revealed two important observations. First, user profiles change over time: the older a profile the more it differs from the current profile of the user. The actual profile distance varies between the different types of profiles. Second, weekend profiles differ significantly from weekday profiles.
5
Exploitation of User Profiles for Personalized News Recommendations
In this section, we investigate the impact of the different user modeling strategies on recommending news articles: 1. To which degree are the profiles created by the different user modeling strategies appropriate for recommending news? 2. Can the identified (temporal) patterns be applied to improve recommendation accuracy? 5.1
News Recommender System and Evaluation Methodology
Recommending news articles is a non-trivial task as the news items, which are going to be recommended, are new by its very nature, which makes it difficult to apply collaborative filtering methods, but rather calls for content-based or hybrid approaches [13]. Our main goal is to analyze and compare the applicability of the different user modeling strategies in the context of news recommendations. We do not aim to optimize recommendation quality, but are interested in comparing the quality achieved by the same recommendation algorithm when inputting different types of user profiles. Therefore we apply a lightweight content-based algorithm that recommends items according to their cosine similarity with a given user profile. We thus cast the recommendation problem into a search and ranking problem where the given user profile, which is constructed by a specific user modeling strategy, is interpreted as query. Definition 2 (Recommendation Algorithm). Given a user profile vector p(u) and a set of candidate news items N = {p(n1 ), ..., p(nn )}, which are represented via profiles using the same vector representation, the recommendation algorithm ranks the candidate items according to their cosine similarity to p(u). Given the Twitter and news media dataset described in Section 4.1, we considered the last week of our observation period as the time frame for computing recommendations. The ground truth of news articles, which we consider as relevant for a specific user u, is obtained via the Twitter messages (including retweets) posted by u in this week that explicitly link to a news article published by BBC, CNN or New York Times. We thereby identified, on average, 5.5 relevant news articles for each of the 1619 users from our sample. For less than 10% of the
10
F. Abel et al.
(a) Type of profile
(b) Impact of news-based enrichment
(c) Fresh vs. complete profile history
(d) Weekend recommendations
Fig. 4. Results of news recommendation experiment
users we found more than 20 relevant articles. The candidate set of news articles, which were published within the recommendation time frame, contained 5529 items. We then applied the different user modeling strategies together with the above algorithm (see Def. 2) and set of candidate items to compute news recommendations for each user. The user modeling strategies were only allowed to exploit tweets published before the recommendation period. The quality of the recommendations was measured by means of MRR (Mean Reciprocal Rank), which indicates at which rank the first item relevant to the user occurs on average, and S@k (Success at rank k), which stands for the mean probability that a relevant item occurs within the top k of the ranking. In particular, we will focus on S@10 as our recommendation system will list 10 recommended news articles to a user. We tested statistical significance of our results with a two-tailed t-Test where the significance level was set to α = 0.01 unless otherwise noted. 5.2
Results
The results of the news recommendation experiment are summarized in Fig. 4 and validate findings of our analysis presented in Section 4. Entity-based user modeling (with news-based enrichment), which produces according to the quantitative analysis (see Fig. 1) the most valuable profiles, allowed for the best recommendation quality and performed significantly better than hashtag-based user modeling (see Fig. 4(a)). Topic-based user modeling also performed better than the hashtag-based strategy – regarding S@10 the performance difference is significant. Since the topic-based strategy models user interests within a space of 18 different topics (e.g., politics or sports), it further required much less run-time and memory for computing user profiles and recommendations than the hashtagand entity-based strategies, for which we limited dimensions to the 10,0000 most prominent hashtags and entities respectively. Further enrichment of topic- and entity-based profiles with topics and entities extracted from linked news articles, which results in profiles that feature more
Analyzing User Modeling on Twitter for Personalized News Recommendations
11
facets and information about users’ concerns (cf. Section 4.2), also results in a higher recommendation quality (see Fig. 4(b)). Exploiting both tweets and linked news articles for creating user profiles improves MRR significantly (α = 0.05). In Section 4.3 we observed that user profiles change over time and that recent profile information approximates future profiles slightly better than old profile information. We thus compared strategies that exploited just recent Twitter activities (two weeks before the recommendation period) with the strategies that exploit the entire user history (see Fig. 4(c)). For the topic-based strategy we see that fresh user profiles are more applicable for recommending news articles than profiles that were built based on the entire user history. However, entity-based user modeling enables better recommendation quality when the complete user history is applied. Results of additional experiments [12] suggest that this is due to the number of distinct entities that occur in entity-based profiles (cf. Fig. 1): long-term profiles seem to refine preferences regarding entities (e.g. persons or events) better than short-term profiles. In Section 4.3 we further observed the so-called weekend pattern, i.e. user profiles created based on Twitter messages published on the weekends significantly differ from profiles created during the week. To examine the impact of this pattern on the accuracy of the recommendations we focused on recommending news articles during the weekend and compared the performance of user profiles created just by exploiting weekend activities with profiles created based on the complete set of Twitter activities (see Fig. 4(d)). Similarly to Fig. 4(c) we see again that the entity-based strategy performs better when exploiting the entire user history while the topic-based strategy benefits from considering the weekend pattern. For the topic-based strategy recommendation quality with respect to MRR improves significantly when profiles from the weekend are applied to make recommendations during the weekend.
6
Conclusions
In this paper we developed a user modeling framework for Twitter and investigated how the different design alternatives influence the characteristics of the generated user profiles. Given a large dataset consisting of more than 2 million tweets we created user profiles and revealed several advantages of semantic entity- and topic-based user modeling strategies, which exploit the full functionality of our framework, over hashtag-based user modeling. We saw that further enrichment with semantics extracted from news articles, which we correlated with the users’ Twitter activities, enhanced the variety of the constructed profiles and improved accuracy of news article recommendations significantly. Further, we analyzed the temporal dynamics of the different types of profiles. We observed how profiles change over time and discovered temporal patterns such as characteristic differences between weekend and weekday profiles. We also showed that the consideration of such temporal characteristics is beneficial to recommending news articles when dealing with topic-based profiles while for entity-based profiles we achieve better performance when incorporating the entire user history. In future work, we will further research the temporal specifics
12
F. Abel et al.
of entity-based profiles. First results [12] suggest that users refer to certain types of entities (e.g., persons) more consistently over time than to others (e.g., movies or events). Acknowledgements. This work is partially sponsored by the EU FP7 project ImREAL (http://imreal-project.eu).
References 1. Lerman, K., Ghosh, R.: Information contagion: an empirical study of spread of news on Digg and Twitter social networks. In: Cohen, Gosling (eds.) Proc. of 4th Int. Conf. on Weblogs and Social Media (ICWSM). AAAI Press, Menlo Park (2010) 2. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event detection by social sensors. In: Rappa, et al. (eds.) Proc. of 19th Int. Conf. on World Wide Web (WWW), pp. 851–860. ACM, New York (2010) 3. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: Rappa, et al. (eds.) Proc. of 19th Int. Conf. on World Wide Web (WWW), pp. 591–600. ACM, New York (2010) 4. Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: Finding topic-sensitive influential Twitterers. In: Davison, et al. (eds.) Proc. of 3rd Int. Conf. on Web Search and Web Data Mining (WSDM), pp. 261–270. ACM, New York (2010) 5. Dong, A., Zhang, R., Kolari, P., Bai, J., Diaz, F., Chang, Y., Zheng, Z., Zha, H.: Time is of the essence: improving recency ranking using twitter data. In: Rappa, et al. (eds.) Proc. of 19th Int. Conf. on World Wide Web (WWW), pp. 331–340. ACM, New York (2010) 6. Chen, J., Nairn, R., Nelson, L., Bernstein, M., Chi, E.: Short and tweet: experiments on recommending content from information streams. In: Mynatt, et al. (eds.) Proc. of 28th Int. Conf. on Human factors in Computing Systems (CHI), pp. 1185–1194. ACM, New York (2010) 7. Laniado, D., Mika, P.: Making sense of Twitter. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 470–485. Springer, Heidelberg (2010) 8. Huang, J., Thornton, K.M., Efthimiadis, E.N.: Conversational Tagging in Twitter. In: Chignell, M.H., Toms, E. (eds.) Proc. of 21st Conf. on Hypertext and Hypermedia (HT), pp. 173–178. ACM, New York (2010) 9. Koren, Y.: Collaborative filtering with temporal dynamics. In: Elder, et al. (eds.) Proc. of 15th Int. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 447–456. ACM, Paris (2009) 10. Rowe, M., Stankovic, M., Laublet, P.: Mapping Tweets to Conference Talks: A Goldmine for Semantics. In: Passant, et al. (eds.) Workshop on Social Data on the Web (SDoW), Co-located with ISWC 2010, Shanghai, China, vol. 664, CEURWS.org (2010) 11. Abel, F., Gao, Q., Houben, G.J., Tao, K.: Semantic Enrichment of Twitter Posts for User Profile Construction on the Social Web. In: Antoniou, et al. (eds.) Extended Semantic Web Conference (ESWC), Springer, Heraklion (2011) 12. Abel, F., Gao, Q., Houben, G.J., Tao, K.: Supporting website: code, datasets and additional findings (2011), http://wis.ewi.tudelft.nl/umap2011/ 13. Liu, J., Dolan, P., Pedersen, E.R.: Personalized news recommendation based on click behavior. In: Rich, et al. (eds.) Proc. of 14th Int. Conf. on Intelligent User Interfaces (IUI), pp. 31–40. ACM, New York (2010)
Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems Ryan S.J.d. Baker1, Zachary A. Pardos2, Sujith M. Gowda1, Bahador B. Nooraei2, and Neil T. Heffernan2 1
Department of Social Science and Policy Studies, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609 USA
[email protected],
[email protected] 2 Department of Computer Science, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA 01609 USA {zpardos,bahador,nth}@wpi.edu
Abstract. Over the last decades, there have been a rich variety of approaches towards modeling student knowledge and skill within interactive learning environments. There have recently been several empirical comparisons as to which types of student models are better at predicting future performance, both within and outside of the interactive learning environment. However, these comparisons have produced contradictory results. Within this paper, we examine whether ensemble methods, which integrate multiple models, can produce prediction results comparable to or better than the best of nine student modeling frameworks, taken individually. We ensemble model predictions within a Cognitive Tutor for Genetics, at the level of predicting knowledge action-byaction within the tutor. We evaluate the predictions in terms of future performance within the tutor and on a paper post-test. Within this data set, we do not find evidence that ensembles of models are significantly better. Ensembles of models perform comparably to or slightly better than the best individual models, at predicting future performance within the tutor software. However, the ensembles of models perform marginally significantly worse than the best individual models, at predicting post-test performance. Keywords: student modeling, ensemble methods, Bayesian Knowledge-Tracing, Performance Factors Analysis, Cognitive Tutor.
1 Introduction Over the last decades, there have been a rich variety of approaches towards modeling student knowledge and skill within interactive learning environments, from Overlay Models, to Bayes Nets, to Bayesian Knowledge Tracing [6], to models based on ItemResponse Theory such as Performance Factors Analysis (PFA) [cf. 13]. Multiple variants within each of these paradigms have also been created – for instance, within Bayesian Knowledge Tracing (BKT), BKT models can be fit using curve-fitting [6], expectation maximization (EM) [cf. 4, 9], dirichlet priors on EM [14], grid search/brute force [cf. 2, 10], and BKT has been extended with contextualization of Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 13–24, 2011. © Springer-Verlag Berlin Heidelberg 2011
14
R.S.J.d. Baker et al.
guess and slip [cf. 1, 2] and student priors [9, 10]. Student models have been compared in several fashions, both within and across paradigms, including both theoretical comparisons [1, 3, 15] and empirical comparisons at predicting future student performance [1, 2, 7, 13], as a proxy for the models’ ability to infer latent student knowledge/skills. These empirical comparisons have typically demonstrated that there are significant differences between different modeling approaches, an important finding, as increased model accuracy can improve optimization of how much practice each student receives [6]. However, different comparisons have in many cases produced contradictory findings. For instance, Pavlik and colleagues [13] found that Performance Factors Analysis predicts future student performance within the tutoring software better than Bayesian Knowledge Tracing, whether BKT is fit using expectation maximization or brute force, and that brute force performs comparably to or better than expectation maximization. By contrast, Gong et al. [7] found that BKT fit with expectation maximization performed equally to PFA and better than BKT fit with brute force. In other comparisons, Baker, Corbett, & Aleven [1] found that BKT fit with expectation maximization performed worse than BKT fit with curve-fitting, which in turn performed worse than BKT fit with brute force [2]. These comparisons have often differed in multiple fashions, including the data set used, and the type (or presence) of cross-validation, possibly explaining these differences in results. However, thus far it has been unclear which modeling approach is “best” at predicting future student performance. Within this paper, we ask whether the paradigm of asking which modeling approach is “best” is a fruitful approach at all. An alternative is to use all of the paradigms at the same time, rather than trying to isolate a single best approach. One popular approach for doing so is ensemble selection [16], where multiple models are selected in a stepwise fashion and integrated into a single predictor using weighted averaging or voting. Up until the recent KDD2010 student modeling competition [11, 18], ensemble methods had not used in student modeling for intelligent tutoring systems. In this paper, we take a set of potential student knowledge/performance models and ensemble them, including approaches well-known within the student modeling community [e.g. 7, 16] and approaches tried during the recent KDD2010 student modeling competition [cf. 11, 18]. Rather than selecting from a very large set of potential models [e.g. 16], a popular approach to ensemble selection, we ensemble existing models of student knowledge, in order to specifically investigate whether combining several current approaches to student knowledge modeling is better than using the best of the current approaches, by itself. We examine the predictive power of ensemble models and original models, under cross-validation.
2 Student Models Used 2.1 Bayesian Knowledge-Tracing Corbett & Anderson’s [6] Bayesian Knowledge Tracing model is one of the most popular methods for estimating students’ knowledge. It underlies the Cognitive Mastery Learning algorithm used in Cognitive Tutors for Algebra, Geometry, Genetics, and other domains [8].
Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems
15
The canonical Bayesian Knowledge Tracing (BKT) model assumes a two-state learning model: for each skill/knowledge component the student is either in the learned state or the unlearned state. At each opportunity to apply that skill, regardless of their performance, the student may make the transition from the unlearned to the learned state with learning probability . The probability of a student going from the learned state to the unlearned state (i.e. forgetting a skill) is fixed at zero. A student who knows a skill can either give a correct performance, or slip and give an incorrect answer with probability . Similarly, a student who does not know the skill may guess the correct response with probability . The model has another , which is the probability of a student knowing the skill from the parameter, start. After each opportunity to apply the rule, the system updates its estimate of student’s knowledge state, , using the evidence from the current action’s correctness and the probability of learning. The equations are as follows: |
(1) |
|
(2) |
1
|
(3)
, , , and , are learned from The four parameters of BKT, ( existing data, historically using curve-fitting [6], but more recently using expectation maximization (BKT-EM) [5] or brute force/grid search (BKT-BF) [cf. 2, 10]. Within this paper we use BKT-EM and BKT-BF as two different models in this study. Within BKT-BF, for each of the 4 parameters all potential values at a grain-size of 0.01 are tried across all the students (for e.g.: 0.01 0.01 0.01 0.01, 0.01 0.01 0.01 0.02, 0.01 0.01 0.01 0.03…… 0.99 0.99 0.3 0.1). The sum of squared residuals (SSR) is minimized. For BKT-BF, the values for Guess and Slip are bounded in order to avoid the “model degeneracy” problems that arise when performance parameter estimates rise above 0.5 [1]. For BKT-EM the parameters were unbounded and initial parameters were set to a of 0.14, of 0.09, of 0.50, and of 0.14, a set of parameters previously found to be the average parameter values across all skills in modeling work conducted within a different tutoring system. In addition, we include three other variants on BKT. The first variant changes the data set used during fitting. BKT parameters are typically fit to all available students’ performance data for a skill. It has been argued that if fitting is conducted using only the most recent student performance data, more accurate future performance prediction can be achieved than when fitting the model with all of the data [11]. In this study, we included a BKT model trained only on a maximum of the 15 most recent student responses on the current skill, BKT-Less Data. The second variant, the BKT-CGS (Contextual Guess and Slip) model, is an extension of BKT [1]. In this approach, Guess and Slip probabilities are no longer estimated for each skill; instead, they are computed each time a student attempts to answer a new problem step, based on machine-learned models of guess and slip response properties in context (for instance, longer responses and help requests are
16
R.S.J.d. Baker et al.
less likely to be slips). The same approach as in [1] is used to create the model, where 1) a four-parameter BKT model is obtained (in this case BKT-BF), 2) the fourparameter model is used to generate labels of the probability of slipping and guessing for each action within the data set, 3) machine learning is used to fit models predicting these labels, 4) the machine-learned models of guess and slip are substituted into Bayesian Knowledge Tracing in lieu of skill-by-skill labels for guess are fit. and slip, and finally 5) parameters for P(T) and Recent research has suggested that the average Contextual Slip values from this model, combined in linear regression with standard BKT, improves prediction of post-test performance compared to BKT alone [2]. Hence, we include average Contextual Slip so far as an additional potential model. The third BKT variant, the BKT-PPS (Prior Per Student) model [9], breaks from the standard BKT assumption that each student has the same incoming knowledge, . This individualization is accomplished by modifying the prior parameter for each student with the addition of a single node and arc to the standard BKT model. The model can be simplified to only model two different student knowledge priors, a high and a low prior. No pre-test needs to be administered to determine which prior the student belongs to; instead their first response is used. If a student answers their first question of the skill incorrectly they are assumed to be in the low prior group. If they answer correctly, they assumed to be in the high prior group. The prior of each group can be learned or it can be set ad-hoc. The intuition behind the ad-hoc high prior, conditioned upon first response, is that it should be roughly 1 minus the probability of guess. Similarly, the low prior should be equivalent to the probability of slip. Using PPS with a low prior value of 0.10 and a high value of 0.85 has been shown to lead to improved accuracy at predicting student performance [11]. 2.2 Tabling A very simple baseline approach to predicting a student’s performance, given his or her past performance data, is to check what percentage of students with that same pattern of performance gave correct answer to the next question. That is the key idea behind the student performance prediction model called Tabling [17]. In the training phase, a table is constructed for each skill: each row in that table represents a possible pattern of student performance in most recent data points. For 3 (which is the table size used in this study), we have 8 rows: 000, 001, 010, 011, 100, 101, 110, 111. (0 and 1 representing incorrect and correct responses respectively) For each of those patterns we calculate the percentage of correct responses immediately following the pattern. For example, if we have 47 students that answered 4 questions in a row correctly (111 1), and 3 students that after answering 3 correct responses, failed on the 4th one, the value calculated for row 111 is going to be 0.94 (47/(47+3)). When predicting a student’s performance, this method simply looks up the row corresponding to the 3 preceding performance data, and uses the percentage value as its prediction. 2.3 Performance Factors Analysis Performance Factors Analysis (PFA) [12, 13] is a logistic regression model, an elaboration of the Rasch model from Item Response Theory. PFA predicts student
Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems
17
correctness based on the student’s number of prior failures on that skill (weighted by a parameter ρ fit for each skill) and the student’s number of prior successes on that skill (weighted by a parameter γ fit for each skill). An overall difficulty parameter β is also fit for each skill [13] or each item [12] – in this paper we use the variant of PFA that fits β for each skill. The PFA equation is: ,
, ,
∑
(4)
2.4 CFAR CFAR, which stands for “Correct First Attempt Rate”, is an extremely simple algorithm for predicting student knowledge and future performance, utilized by the winners of the educational data KDD Cup in 2010 [18]. The prediction of student performance on a given skill is the student’s average correctness on that skill, up until the current point.
3 Genetics Dataset The dataset contains the results of in-tutor performance data of 76 students on 9 different skills, with data from a total of 23,706 student actions (entering an answer or requesting help). This data was taken from a Cognitive Tutor for Genetics [5]. This tutor consists of 19 modules that support problem solving across a wide range of topics in genetics (Mendelian transmission, pedigree analysis, gene mapping, gene regulation and population genetics). Various subsets of the 19 modules have been piloted at 15 universities in North America.
Fig. 1. The Three-Factor Cross lesson of the Genetics Cognitive Tutor
18
R.S.J.d. Baker et al.
This data set is drawn from a Cognitive Tutor lesson on three-factor cross, shown in Figure 1. In three factor-cross problems, two organisms are bred together, and then the patterns of phenotypes and genotypes on a chromosome are studied. In particular, the interactions between three genes on the same chromosome are studied. During meiosis, segments of the chromosome can “cross over”, going from one paired chromosome to the other, resulting in a different phenotype in the offspring than if the crossover did not occur. Within this tutor lesson, the student identifies, within the interface, the order and distance between the genes on the chromosome, by looking at the relative frequency of each pattern of phenotypes in the offspring. The student also categorizes each phenotype in terms of whether it represents the same genotype as the parents (e.g. no crossovers during meiosis), whether it represents a single crossover during meiosis, or whether it represents two crossovers during meiosis. In this study, 76 undergraduates enrolled in a genetics course at Carnegie Mellon University used the three-factor cross module as an assignment conducted in two lab sessions lasting an hour apiece. The 76 students completed a total of 23,706 problem solving attempts across 11,582 problem steps in the tutor. On average, each student completed 152 problem steps (SD=50). In the first session, students were split into four groups with a 2x2 design; half of students spent half their time in the first session self-explaining worked examples; half of students spent half their time in a forward modeling activity. Within this paper, we focus solely on behavior logged within the problem-solving activities, and we collapse across the original four conditions. The problem-solving pre-test and post-test consisted of two problems (counterbalanced across tests), each consisting of 11 steps involving 7 of the 9 skills in the Three-Factor Cross tutor lesson, with two skills applied twice in each problem and one skill applied three times. The average performance on the pre-test was 0.33, with a standard deviation of 0.2. The average performance on the post-test was 0.83, with a standard deviation of 0.19. This provides evidence for substantial learning within the tutor, with an average pre-post gain of 0.50.
4 Evaluation of Models 4.1 In-Tutor Performance of Models, at Student Level To evaluate each of the student models mentioned in section 2, we conducted 5-fold cross-validation, at the student level. By cross-validating at the student level rather than the action level, we can have greater confidence that the resultant models will generalize to new groups of students. The variable fit to and predicted was whether each student first attempt on a problem step was Correct or Not Correct. We used A' as the goodness metric since it is a suitable metric to be used when predicted variable is binary and the predictions are numerical (predictions of knowledge for each model). To facilitate statistical comparison of A' without violating statistical independence, A' values were calculated for each student separately and then averaged across students (see [2] for more detail on this statistical method). The performance of each model is given in Table 1. As can be seen, the best single model was BKT-PPS (A'=0.7029), with the second-best single model BKT-BF (A'=0.6969) and the third-best single model BKT-EM (A'=0.6957). None of these
Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems
19
three BKT models was significantly different than each other (the difference closest to significance was between BKT-PPS and BKT-BF, Z=0.11, p=0.91). Interestingly, in light of previous results [e.g. 16], each of these three models was significantly better than PFA (A'= 0.6629) (the least significant difference was between BKT-PPS and PFA, Z=3.21, p=0.01). The worst single model was BKT-CGS (A'=0.4857), and the second-worst single model was CFAR (A'=0.5705). Table 1. A' values averaged across students for each of the models Model BKT-PPS Ensemble: linear regression without feature selection (BKT-PPS, BKT-EM, Contextual Slip) Ensemble: linear regression without feature selection (BKT-PPS, BKT-EM) BKT-BF
Average A' 0.7029 0.7028 0.6973 0.6969
BKT-EM Ensemble: linear regression without feature selection
0.6957 0.6945
Ensemble: stepwise linear regression
0.6943
Ensemble: logistic regression without feature selection
0.6854
BKT-LessData (maximum 15 data points per student, per skill)
0.6839
PFA Tabling
0.6629 0.6476
Contextual Slip
0.6149
CFAR
0.5705
BKT-CGS
0.4857
These models’ predictions were ensembled using three algorithms: linear regression without feature selection (e.g. including all models), stepwise linear regression (e.g. starting with an empty model, and repeatedly adding the model that most improves fit, until no model significantly improves fit), and logistic regression without feature selection (e.g. including all models). When using stepwise regression, we discovered that for each fold, the first three models added to the ensemble were BKT-PPS, BKT-EM, and Contextual Slip. In order to test these features alone, we turned off feature selection and tried linear regression ensembling using only these three features, and linear regression ensembling using only BKT-PPS and BKT-EM (the first two models added). Interestingly, these restricted ensembles appeared to result in better A' than the full-model ensembles, although the difference was not statistically significant (comparing the 3-model linear regression vs. the full linear regression without feature selection – the best of the full-model ensembles – gives Z=0.87, p=0.39). The ensembling models appeared to perform worse than BKT-PPS, the best single model. However, the difference between BKT-PPS and the worst ensembling model, logistic regression, was not statistically significant, Z=0.90, p=0.37.
20
R.S.J.d. Baker et al.
In conclusion, contrary to the original hypothesis, ensembling of multiple student models using regression does not appear to improve ability to predict student performance, when considered at the level of predicting student correctness in the tutor, cross-validated at the student level. 4.2 In-Tutor Performance of Models at Action Level In the KDD Cup, a well-known Data Mining and Knowledge Discovery competition, the prediction ability of different models is compared based on how well each model predicts each first attempt at each problem step in the data set, instead of averaging within students and then across students. This is a more straightforward approach, although it has multiple limitations: it is less powerful for identifying individual students’ learning, less usable in statistical analyses (analyses conducted at this level violate statistical independence assumptions [cf. 2]), and may bias in favor of predicting students who contribute more data. Note that we do not re-fit the models in this section; we simply re-analyze the models with a different goodness metric. When we do so, we obtain the results shown in Table 2. For this estimation method, ensembling appears to generally perform better than single models, although the difference between the best ensembling method and best single model is quite small (A'=0.7451 versus A'=0.7348). (Note that statistical results are not given, because conducting known statistical tests for A' at this level violates independence assumptions [cf. 2]). This finding suggests that how data is organized can make a difference in findings on goodness. However, once again, ensembling does not appear to make a substantial difference in predictive power. Table 2. A' computed at the action level for each of the models Model Ensemble: linear regression without feature selection (BKT-PPS, BKT-EM, Contextual Slip) Ensemble: linear regression without feature selection Ensemble: stepwise linear regression Ensemble: logistic regression without feature selection Ensemble: linear regression without feature selection (BKT-PPS, BKT-EM) BKT-EM BKT-BF BKT-PPS PFA BKT-LessData (maximum 15 data points per student, per skill) CFAR Tabling Contextual Slip BKT-CGS
A' (calculated for the whole dataset) 0.7451 0.7428 0.7423 0.7359 0.7348 0.7348 0.7330 0.7310 0.7277 0.7220 0.6723 0.6712 0.6396 0.4917
Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems
21
4.3 Models Predicting Post-Test Another possible level where ensembling may be beneficial is at predicting the posttest; for example, if individual models over-fit to specific details of in-tutor behavior, a multiple-model ensemble may avoid this over-fit. In predicting the post-test, we account for the number of times each skill will be utilized on the test (assuming perfect performance). Of the eight skills in the tutor lesson, one is not exercised on the test, and is eliminated from post-test prediction. Of the remaining seven skills, four are exercised once, two are exercised twice and one is exercised three times, in each of the two posttest problems. These first two skills are each counted twice and the latter skill three times in our attempts to predict the post-test. We utilize this approach in all attempts to predict the post-test in this paper. We use Pearson’s correlation as the goodness metric since the model estimates and the post-test scores are both numerical. Correlation between each model and the post-test is given in Table 3. From the table we can see that BKT-LessData does better than all other individual models and ensemble models and achieves a correlation of 0.565 to the post-test. BKT-EM and BKT-BF perform only slightly worse than BKT-LessData, respectively achieving correlations of 0.552 and 0.548. Next, the ensemble involving just BKTPPS and BKT-EM achieves a correlation of 0.54. The difference between BKTLessData (the best individual model) and the best ensemble was marginally statistically significant, t(69)=1.87, p=0.07, for a two-tailed test of the significance of the difference between correlations for the same sample. At the bottom of the pack are BKT-CGS and Contextual Slip. Table 3. Correlations between model predictions and post-test Model BKT-LessData (maximum 15 data points per student, per skill) BKT-EM BKT-BF Ensemble: linear regression without feature selection (BKT-PPS, BKT-EM) CFAR BKT-PPS Ensemble: logistic regression without feature selection Ensemble: linear regression without feature selection (BKT-PPS, BKT-EM, Contextual Slip) Ensemble: linear regression without feature selection PFA Tabling Ensemble: stepwise linear regression Contextual Slip BKT-CGS
Correlation to post-test 0.565 0.552 0.548 0.540 0.533 0.499 0.480 0.438 0.342 0.324 0.272 0.254 0.057 -0.237
22
R.S.J.d. Baker et al.
5 Discussion and Conclusions Within this paper, we have compared several different models for tracking student knowledge within intelligent tutoring systems, as well as some simple approaches for ensembling multiple student models at the action level. We have compared these models in terms of their power to predict student behavior in the tutor (crossvalidated) and on a paper post-test. Contrary to our original hypothesis, ensembling at the action level did not result in unambiguously better predictive power across analyses than the best of the models taken individually. Ensembling appeared slightly better for flat (e.g. ignoring student) assessment of within-tutor behavior, but was equivalent to a variant of Bayesian Knowledge Tracing (BKT-PPS) for student-level cross-validation of within-tutor behavior, and marginally or non-significantly worse than other variants of Bayesian Knowledge Tracing for predicting the post-test. One possible explanation for the lack of a positive finding for ensembling is that the models may have been (overall) too similar for ensembling to function well. Another possible explanation is that the differing number of problem steps per student may have caused the current ensembling method to over-fit to students contributing larger amounts of data. Thirdly, it may be that the overall data set was too small for ensembling to perform effectively, suggesting that attempts to replicate these results should be conducted on larger data sets, in order to test this possibility. A second interesting finding was the overall strong performance of Bayesian Knowledge Tracing variants for all comparisons, with relatively little difference between different ways of fitting the classic BKT model (BKT-EM and BKT-BF) or a recent variant, BKT-PPS. More recent approaches (e.g. PFA, CFAR, Tabling) performed substantially worse than BKT variants on all comparisons. In the case of PFA, these findings contradict other recent research [7, 13] which found that PFA performed better than BKT. However, as in that previous research, the differences between PFA and BKT were relatively small, suggesting that either of these approaches (or for that matter, most variants of BKT) are acceptable methods for student modeling. It may be of greater value for future student modeling research to attempt to investigate the question of when and why different student model frameworks have greater predictive power, rather than attempting to answer which framework is best overall. Interestingly, among BKT variants, BKT-CGS performed quite poorly. One possible explanation is that this data set had relatively little data and relatively few skills, compared to the data sets previously studied with this method [e.g. 1], another potential reason why it may make sense to study whether these results replicate within a larger data set. BKT-CGS has previously performed poorly on other data sets from this same tutor [2], perhaps for the same reason. However, the low predictive power of average contextual slip for the post-test does not contradict the finding in [2] that average contextual slip plus BKT predicts the post-test better than BKT alone; in that research, these two models were combined at the post-test level rather than within the tutor. In general, average contextual slip was a productive component of ensembling models (as the third feature selected in each fold) despite its poor individual performance, suggesting it may be a useful future component of student models.
Ensembling Predictions of Student Knowledge within Intelligent Tutoring Systems
23
Overall, this paper suggests that Bayesian Knowledge-Tracing remains a highlyeffective approach for predicting student knowledge. Our first attempts to utilize ensembling did not perform substantially better than BKT overall; however, it may be that other methods of ensembling will in the future prove more effective. Acknowledgements. This research was supported by the National Science Foundation via grant “Empirical Research: Emerging Research: Robust and Efficient Learning: Modeling and Remediating Students’ Domain Knowledge”, award number DRL0910188, and by a “Graduates in K-12 Education” (GK-12) Fellowship, award number DGE0742503. We would like to thank Albert Corbett for providing the data set used in this paper, and for comments and suggestions.
References 1. Baker, R.S.J.d., Corbett, A.T., Aleven, V.: More Accurate Student Modeling through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing. In: Woolf, B.P., Aïmeur, E., Nkambou, R., Lajoie, S. (eds.) ITS 2008. LNCS, vol. 5091, pp. 406–415. Springer, Heidelberg (2008) 2. Baker, R.S.J.d., Corbett, A.T., Gowda, S.M., Wagner, A.Z., MacLaren, B.A., Kauffman, L.R., Mitchell, A.P., Giguere, S.: Contextual Slip and Prediction of Student Performance after Use of an Intelligent Tutor. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 52–63. Springer, Heidelberg (2010) 3. Brusilovsky, P., Millán, E.: User models for adaptive hypermedia and adaptive educational systems. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 3–53. Springer, Heidelberg (2007) 4. Chang, K.-m., Beck, J.E., Mostow, J., Corbett, A.T.: A bayes net toolkit for student modeling in intelligent tutoring systems. In: Ikeda, M., Ashley, K.D., Chan, T.-W. (eds.) ITS 2006. LNCS, vol. 4053, pp. 104–113. Springer, Heidelberg (2006) 5. Corbett, A., Kauffman, L., Maclaren, B., Wagner, A., Jones, E.: A Cognitive Tutor for Genetics Problem Solving: Learning Gains and Student Modeling. Journal of Educational Computing Research 42, 219–239 (2010) 6. Corbett, A.T., Anderson, J.R.: Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction 4, 253–278 (1995) 7. Gong, Y., Beck, J.E., Heffernan, N.T.: Comparing Knowledge Tracing and Performance Factor Analysis by Using Multiple Model Fitting Procedures. In: Aleven, V., Kay, J., Mostow, J. (eds.) ITS 2010. LNCS, vol. 6094, pp. 35–44. Springer, Heidelberg (2010) 8. Koedinger, K.R., Corbett, A.T.: Cognitive tutors: Technology bringing learning science to the classroom. In: Sawyer, K. (ed.) The Cambridge Handbook of the Learning Sciences, pp. 61–78. Cambridge University Press, New York (2006) 9. Pardos, Z.A., Heffernan, N.T.: Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 255–266. Springer, Heidelberg (2010) 10. Pardos, Z.A., Heffernan, N.T.: Navigating the parameter space of Bayesian Knowledge Tracing models: Visualizations of the convergence of the Expectation Maximization algorithm. In: Proceedings of the 3rd International Conference on Educational Data Mining, pp. 161–170 (2010)
24
R.S.J.d. Baker et al.
11. Pardos, Z.A., Heffernan, N.T.: Using HMMs and bagged decision trees to leverage rich features of user and skill from an intelligent tutoring system dataset. To appear in Journal of Machine Learning Research W & CP 12. Pavlik, P.I., Cen, H., Koedinger, K.R.: Learning Factors Transfer Analysis: Using Learning Curve Analysis to Automatically Generate Domain Models. In: Proceedings of the 2nd International Conference on Educational Data Mining, pp. 121–130 (2009) 13. Pavlik, P.I., Cen, H., Koedinger, K.R.: Performance Factors Analysis – A New Alternative to Knowledge Tracing. In: Proceedings of the 14th International Conference on Artificial Intelligence in Education, pp. 531–538 (2009), Version of paper used is online at http://eric.ed.gov/PDFS/ED506305.pdf (retrieved January 26, 2011); This version has minor differences from the printed version of this paper 14. Rai, D., Gong, Y., Beck, J.E.: Using Dirichlet priors to improve model parameter plausibility. In: Proceedings of the 2nd International Conference on Educational Data Mining, Cordoba, Spain, pp. 141–148 (2009) 15. Reye, J.: Student modeling based on belief networks. International Journal of Artificial Intelligence in Education 14, 1–33 (2004) 16. Caruana, R., Niculescu-Mizil, A.: Ensemble selection from libraries of models. In: Proceedings of the 21st International Conference on Machine Learning, ICML 2004 (2004) 17. Wang, Q.Y., Pardos, Z.A., Heffernan, N.T.: Fold Tabling Method: A New Alternative and Complement to Knowledge Tracing (manuscript under review) 18. Yu, H.-F., Lo, H.-Y., Hsieh, H.-P., Lou, J.-K., McKenzie, T.G., Chou, J.-W., et al.: Feature Engineering and Classifier Ensemble for KDD Cup 2010. In: Proceedings of the KDD Cup 2010 Workshop, pp. 1–16 (2010)
Creating Personalized Digital Human Models of Perception for Visual Analytics Mike Bennett1 and Aaron Quigley2 1
SCIEN, Department of Psychology, Jordan Hall, Building 01-420, Stanford University, CA 94305, United States
[email protected] 2 SACHI, School of Computer Science, North Haugh, University of St Andrews, KY16 9SX, UK
[email protected] Abstract. Our bodies shape our experience of the world, and our bodies influence what we design. How important are the physical differences between people? Can we model the physiological differences and use the models to adapt and personalize designs, user interfaces and artifacts? Within many disciplines Digital Human Models and Standard Observer Models are widely used and have proven to be very useful for modeling users and simulating humans. In this paper, we create personalized digital human models of perception (Individual Observer Models), particularly focused on how humans see. Individual Observer Models capture how our bodies shape our perceptions. Individual Observer Models are useful for adapting and personalizing user interfaces and artifacts to suit individual users’ bodies and perceptions. We introduce and demonstrate an Individual Observer Model of human eyesight, which we use to simulate 3600 biologically valid human eyes. An evaluation of the simulated eyes finds that they see eye charts the same as humans. Also demonstrated is the Individual Observer Model successfully making predictions about how easy or hard it is to see visual information and visual designs. The ability to predict and adapt visual information to maximize how effective it is is an important problem in visual design and analytics. Keywords: virtual humans, physiology modeling, computational user model, individual differences, human vision, digital human model.
1
Introduction
Our bodies shape our experience of the world, and our bodies influence what we design. For example clothes are not designed for people with three arms because designers implicitly model standard human physiology. Yet, human bodies differ, some people are born with small bodies, others with bodies that see colors differently (colorblindness). How important are these physical differences between people? Can we model the physiological differences and use the models to adapt and personalize designs, user interfaces and artifacts? Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 25–37, 2011. c Springer-Verlag Berlin Heidelberg 2011
26
M. Bennett and A. Quigley
look Adapt
Predict
Model
Virtual Eyes Visual Design
update
Fig. 1. To adapt a visualization or visual design we use Individual Observer Models of eyesight. The models integrate with predictors, which feed into adaption techniques for improving the layout and presentation of visualizations and visual designs.
Many domains, such as medicine, health, sports science, and car safety are creating digital human models [7]. These digital human models are very useful for identifying and evaluating the strengths and weakness in prototype artifacts and novel tools. Initially, the majority of the digital human models were primarily concerned with modeling humans’ physical bodies and biomechanics [3]. Recently, there has been a move to richer and multifaceted digital human models, which are capable of modeling many aspects of being human, including modeling aspects of cognition, simulating affect (emotion) [12], modeling group and social dynamics, and simulating aesthetics and taste [6]. Numerous challenges and research opportunities exist for creating and integrating biomechanical models with cognitive and perceptual models [10,16,7]. In this work, we create personalized digital human models of perception, particularly focused on how humans see (Figure 1). With digital human models of eyesight a visual design can be evaluated to establish what parts of a design are easy or difficult to see. For example when viewing an information visualization on a wall sized display from far away, how small can the visual features of the information visualization be before they are impossible to clearly and easily see? Individual differences in human bodies often cause differences in how humans perceive and experience the world, e.g. colorblindness. Introduced in this paper are Individual Observer Models, which are user models of individual bodies and perceptions. Individual Observer Models capture how our bodies shape our perceptions. Individual Observer Models are useful for adapting and personalizing user interfaces to suit individual users’ bodies and perceptions. We introduce and demonstrate an Individual Observer Model of human eyesight, which we use to simulate 3600 different biologically valid human eyes. An evaluation of the simulated eyes finds that they see eye charts the same as humans. Also demonstrated is the Individual Observer Model successfully making predictions about how easy or hard it is to see visual information. The ability to predict and adapt visual information to maximize how effective it is is an important problem in visual design and analytics.
2
Modeling and Creating Individual Virtual Eyes
To build the Individual Observer Model of human eyesight we create a simplified optical model of how the human eye works. The model has parameters for
Personalized Digital Human Models of Perception
27
aberrated wavefront
rays of light Lens ideal wavefront
wave aberration
Fig. 2. Example of ideal and aberrated wavefronts generated by rays of light travelling through an optical system (eye)
controlling the amount of individual differences in eyesight. The eyesight model is built on research from vision science [14], optometry [13] and ophthalmology [9]. Fortunately, modeling individual differences in eyesight is extensively studied in optometry and ophthalmology research [4,21,8,13]. Building models of human eyesight is challenging, both technically and because many questions remain unsolved about how human vision works. In order to build a useful human vision model, we limit how much of human vision we model. Specifically, we focus on modeling how light travels through the human eye. Individual differences between peoples’ eyes are accounted for by modeling individual differences in the physical structure of human eyes. Depending on the physical structure of eyes, some peoples’ eyes are very good at focusing light on the back of the eye, while in other cases the eyes are bad at focusing light. This difference in how well the eye does or does not focus light is due to the amount of aberrations in the eyes. People with eyes that have high amounts of aberrations usually have worse eyesight than those with low amounts of eye aberrations. Nobody has aberration free eyesight, but there are normal amounts and types of aberrations. Differences in the amount of eye aberrations has a large impact on how easily people can or cannot see visual information. In particular, modeling eye aberrations is good for predicting the amount of visual detail people can see. The ability to see visual detail is called visual acuity. Good visual acuity commonly implies low amounts of eye aberrations, or that an eye has been corrected to reduce the impact of the aberrations. Correction is done either with eye glasses, or with various kinds of eye surgery. Visual acuity is known to significantly differ between people. In some cases this difference is genetic in origin, in other cases it is due to age related changes, and other times it is due to medical issues [21]. The ability to model individual human eyes gives us the ability to measure how individual eyes transform visual information. We can take a visual design, pass it through a virtual eye, then capture the visual design as it is seen at the back of the virtual eye. 2.1
Modeling the Flaws and Aberrations in a Human Eye
In this and the following subsections we briefly describe our model of the human eye and how it works. Background on the particular approach we adopt for
28
M. Bennett and A. Quigley
modeling eye aberrations can be found in Krueger et al.’s work on human vision [9]. Far more vision science detail and background on our approach to modeling individual eyes and human vision can be found in [1]. Our eye model accounts for how rays of light travel through the human eye. Looking at Figure 2 you can see that multiple rays of light are entering a lens (eye). After the rays of light pass through the lens they are not aligned with each other, and in some cases mistakenly cross. In a perfect eye the rays of light are focused on a single spot (fovea), while in an aberrated eye the light rays are imperfectly focused. Depending on the location at which a ray of light passes through the lens, it will get aberrated in different ways and by different amounts. In order to model how different parts of the lens affect light rays, we use wavefronts. Wavefronts describe how numerous light rays simultaneously behave over many points of a lens [9]. A wavefront is perpendicular to the light ray paths. For example, in Figure 2 we have an ideal wavefront and an aberrated wavefront. The ideal wavefront is the dotted line, and it represents all the light rays emerging from the lens in parallel. Unfortunately, all the light rays are not parallel so the wavefront is distorted and aberrated. Wavefronts are widely used by ophthalmologists when planning eye surgery to correct human vision, such as LASIK eye surgery [9]. 2.2
Simulating Individual Differences in Eyesight
Wavefronts enable us to model individual eyes, because we can create and simulate wavefronts (Weye ), then use the wavefronts to transform a visual design into what is seen at the back of the human eye. Provided in Equation 1 is the wavefront aberration function for modeling wavefronts. For details on using Zernike Polynomials to model human eyes see [1,18,20,19,9]. The important thing to realize from the wavefront function equation is that the Zernike coefficients (Cnm ) weigh the Zernike modes (Znm ). Each Zernike mode (roughly) corresponds to a particular type of aberration commonly found in the human eye, such as astigmatism, defocus or coma. Each Zernike coefficient Equation 1. Wavefront aberration function as weighted sum of Zernike Polynomials [20]. m m Weye (p, θ) = Cn Zn (p, θ) (1) n,m
where Cnm is Zernike coefficient in microns μm Znm is double indexed Zernike mode (see [19]) and p is normalized pupil radius θ is azimuthal component from 0 to 2π radians
Personalized Digital Human Models of Perception
29
describes how much of each particular kind of aberration occurs. When you sum up all the aberrations you end up with a virtual wavefont (Weye ) that describes how light is altered as it passes through the human eye. To simulate the wavefront of an individual’s eye, we sum the first fourteen aberrations (Zernike modes) and for each aberration set the amount of aberration (Zernike coefficient) by randomly picking a value from within the normal range for that type of aberration. Elsewhere [20], it has been established what the normal values and ranges of each Zernike coefficient is. This was achieved by measuring and analysing over 2560 wavefronts of healthy human eyes [18]. We use the first fourteen aberrations as it has also been previously established that they matter the most. Provided in [1] on page 86 Table 3.2 are the ranges of Zernike coefficients we use. 2.3
Simulating What Is Seen at the Back of an Eye
Once we have a virtual eye wavefront (Weye ), we use the virtual eye (Weye ) to transform the original design into the design as seen by the back of the human eye. In order to transform the original design, we convert the Weye to an image convolution kernel (widely used in image filtering), and then apply the image kernel to the original design. The resulting image is the design as it is seen at the back of the eye. Shown in Figure 3 are examples of how a photograph of a pair of shoes on grass is seen by three different eyes. The amount of individual differences between the eyes is small. A limitation of our current eye simulation is that it is restricted to grayscale images. This restriction exists because in vision science it is not yet known what the normal aberration amounts for color eyesight are. 2.4
Predicting What Users Can or Cannot See
To predict what a user can or cannot see, we use the virtual eyes in a predictor (Figure 1). The predictor quantifies how differently individual eyes see the same visual information. Quantifying the impact of individual differences in eyesight enables us to improve the layout and presentation of visual information, by adapting it to suit individual eyes and groups of eyes. The predictor works by looking at the original visual design through a virtual eye, then it compares the original design against the design as seen at the back of the eye. The difference between the original design and the perceived design gives a measure of how individual differences in peoples’ eyesight impacts upon the perception of visual information.
Fig. 3. Example of how three different simulated eyes see a photograph of shoes on grass. First photograph is the original version.
30
M. Bennett and A. Quigley
Fig. 4. Examples of how increasing amounts of aberration effects two different visual patterns. On the left is the letter C, while on the right is a pattern of alternating bars. Amount of aberration increasing from left to right, top to bottom.
In this work Perceptual Stability For Visual Acuity (P ERSva ) [1] is used to measure the differences between the original and perceived design. P ERSva is a space-scale approach to image analysis, which uses an information theoretic measure of image content. Further details on the vision science motivation and equations for P ERSva can be found elsewhere [1]. When used, P ERSva gives a single value score, which indicates how differently the original and perceived design look. A high P ERSva score indicates that the visual design changes a lot and is harder to see due to passing through an aberrated eye, while a low score indicates that the visual design is easier to see and does not change much due to the aberrations. For example, if the same virtual eye looks at the two visual patterns as shown in Figure 4. For the virtual eye, the C pattern has a lower P ERSva score than the alternating bars, which indicates that the aberrations in the eye effect the perception of the alternating bars more than the perception of the C.
3
Evaluation
In order to evaluate the Individual Observer Model of human eyesight, we test whether the virtual eyes see the same as human eyesight. To establish whether the virtual eyes see the same as human observers, we generate 3600 different biologically valid eyes. Each of the 3600 virtual eyes looks at various visual patterns, such as eye charts. If the virtual eyes are valid, then they should agree with human judgements about what parts of the eye charts are easy or hard to see. For example, shown in Figure 5 is an eye chart that is commonly used to measure how well people can or cannot see. It is known that a person looking at the eye chart will find letters at the top of the chart easier to see than at the bottom of the chart. Figure 8 shows how 300 virtual eyes judge what parts of the eye chart are easier or harder to see (the red text is the normalized P ERSva scores). The top of the eye chart is white, indicating it is seen easiest, while the bottom of the eye chart is black, indicating it is the hardest part of the eye chart to see.
Personalized Digital Human Models of Perception
3.1
31
Eye Charts Are Gold Standard Measure of Human Eyesight
When creating models and simulations of human perception a significant challenge is figuring out how to test whether the simulation agrees with human judgements. For testing the virtual eyes it would be easy to create a stimulus, and then test how well the virtual eyes perceive the stimulus. Unfortunately, that could easily lead to cases where the eye model and simulation are optimized for properties of the stimulus. An additional concern is it is known that subtle variants in a stimulus can lead to significant differences in human perception. An important criteria for testing digital human models of human perception is: Before testing a virtual human with a stimulus, the perceptual properities of the stimulus need to be well understood and established for humans. For testing the digital human models of eyesight we use three different eye charts, which test different aspects of human vision. The creation and validation of eye charts is a sub-field in optometry and vision science. Designing eye charts that do not have any subtle perceptual flaws is tricky, as subtle flaws do lead to incorrect evaluations of peoples’ eyesight. We tested the Individual Observer Model of eyesight with the three different eye charts shown in Figure 5, 6 & 7. These eye charts test different related facets of human vision that are known to effect peoples’ ability to see visual information. Even though the range of visual features on the eye charts is limited (varying letters, size, contrast & lines), it is well established that human performance on these eye charts is a good predictor of human performance at many visual tasks [13,8]. The chart in Figure 5 is called the ETDRS Chart [5] and it tests how well people can see increasingly smaller letters. People can see the top of the chart easier than the bottom. Shown in Figure 6 is the Pelli-Robson Contrast Sensitivity Chart [15], which shows increasingly harder to see letters, where letter hardness increases from left to right going downwards. The letters become harder to see because of reduced contrast between letter color and background color. Figure 7 shows the Campbell-Robson Contrast Sensitivity Chart [2], which tests the combination of varying size and contrast. When looking at the CampbellRobson Chart observers see the visual detail in the region on the right of the chart. As the observers’ visual acuity increases or decreases the size and position of the region they can see detail in either moves up (better vision) or down (worse vision) and gets bigger (better vision) or smaller (worse vision). 3.2
Results of Evaluation of 3600 Virtual Eyes
Each eye chart is independently used as a stimulus, and each eye chart is divided into a range of equal sized regions. Depending on how each eye chart is used to test human vision, we expect that the predictor (P ERSva ) correctly identifies what regions are easier to see when compared to other regions within the same eye chart. For the evaluation 3600 virtual eyes viewed the eye charts and the resulting P ERSva scores were averaged for all the eyes. If the virtual eyes are effective, then they should agree with human judgements for the eye charts.
32
Fig. 5. Chart
M. Bennett and A. Quigley
ETDRS
Eye
Fig. 8. Heatmap of ETDRS eye chart when divided into 1 by 4 regions
Fig. 6. Pelli-Robson Contrast Sensitivity Chart
Fig. 7. Campbell-Robson Contrast Sensitivity Chart
Fig. 9. Heatmap of Pelli- Fig. 10. Heatmap of Robson eye chart when di- Campbell-Robson chart when vided into 2 by 2 regions divided into 4 by 4 regions
For the eye charts we find that the predictions of what the virtual eyes see agrees with what humans see. Results are shown in Figure 8 to Figure 13. These figures show how the predictor scores (normalized P ERSva ) compare between regions. When a region is colored white, it is the easiest to see (lowest normalised P ERSva score), black indicates the hardest to see (highest P ERSva score) and grayscale indicates intermediate hardness / P ERSva score. ETDRS Chart. In Figure 8 we see that the top of the eye chart is white, indicating it is seen easiest, while the bottom of the eye chart is black, indicating it is the hardest part of the eye chart to see. These results are correct. Are the results due to how the eye chart is divided into 1 by 4 regions? That is addressed by also analysing the eye chart divided into 1 by 2 (Figure 11) and 1 by 3 regions. The results are also in agreement with human vision. That is the virtual eyes find top of the eye chart easier to see, with it becoming increasingly harder to see visual information towards the bottom of the eye chart. Pelli-Robson Chart. We find that the virtual eyes see the Pelli-Robson Chart correctly. Shown in Figure 9 are the results, when the eye chart is divided into 2 by 2 regions. The eye chart gets harder to see from left to right going downwards,
Personalized Digital Human Models of Perception
Fig. 12. Heatmap of Fig. 13. Heatmap of Pelli-Robson chart di- Campbell-Robson chart when vided into 2 by 1 regions divided into 2 by 2 regions
1.02
1
1
0.98
Times−Roman PERS va Scores
PERS va Perceptual Stability
Fig. 11. Heatmap of ETDRS chart divided into 1 by 2 regions
0.98 0.96 0.94 0.92 0.9 0.88 0.86
0.96 0.94 0.92 0.9 0.88 0.86 0.84 21pt 16pt 11pt
0.82
0.84 0.82
33
CB 11pt CB 16pt CB 21pt TR 11pt TR 16pt TR 21pt
Font Style & Font Size
Fig. 14. Virtual eye comparison of font styles & sizes. Shown is mean & standard deviation of P ERSva scores.
0.8 0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Courier−Bold PERSva Scores
Fig. 15. Scatterplot showing how each virtual eye sees each font & font size. Each data point is a virtual eye.
where the lower right corner of the chart is the hardest to see. When the eye chart is divided into 2 by 1 (Figure 12) and 2 by 3 regions the results are also in agreement with human vision. Campbell-Robson Chart. An especially interesting result is how the virtual eyes see the Campbell-Robson Chart. It is especially interesting because the eye chart does not use letters, rather it uses a more visually complex pattern. Our virtual eye models do see the Campbell-Robson Chart correctly. When the chart is divided into 4 by 4 regions, results shown in Figure 10, we find the top right of the chart is hardest to see, while the lower right is easiest to see. Nothing is seen on the left side of the chart, as we would expect from human judgements. When the chart is divided into 1 by 2, 1 by 3, 1 by 4, 2 by 2 (Figure 13) and 3 by 3 the virtual eyes are in agreement with how humans see.
4
Demonstration: Looking at Fonts and InfoVis Graphs
Briefly, we describe two examples of the virtual eyes evaluating how easy it is to see different text font styles and different visual styles of graphs. Many fonts
34
M. Bennett and A. Quigley
1.02
PERSva Perceptual Stability
1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.76 0.74
Upper Left
Lower Left
Upper Right Lower Right
Visual Style Of Graph
Fig. 17. Comparison of graphs, mean & Fig. 16. Four different ways to draw the std of P ERSva scores for graphs in Figsame simple graph. Nodes and lines vary. ure 16
styles are often used in visual designs, but it is usually unclear whether one font style is easier to see than another. Especially interesting is whether a particular font style is best for one individual, while a different font style is best for another person. An equivalent challenge for visual analytics and information visualization is correctly designing the visual style line or graph node so that it is easy to see. Comparing Font Styles. Two different font styles are compared, at three different font sizes. Twenty virtual eyes looked at a paragraph of text drawn with the different fonts and at the different sizes. The first font style is CourierBold (CB) font, the second font is Times-Roman (TR), and the 3 font sizes are 11pt, 16pt and 21pt. Shown in Figure 14 are the results of the twenty virtual eyes looking at the fonts. Based on the P ERSva scores, the virtual eyes predict that the CourierBold point size 21 (CB 21pt) is the easiest font to see, while the hardest font to see is Times-Roman 11pt (TR 11pt). In Figure 15 we can check how easily the same eye sees CB versus TR fonts. For example, in Fig 15 the lower left blue triangle indicates a specific virtual eye had a P ERSva score of approximately 0.815 for the CB 21pt font, while the same eye had a P ERSva score of 0.89 for the TR 21pt font. Particularly noteworthy is that these results are in agreement with the work of Mansfield et al [11], who evaluated how legible human subjects find the CB versus TR fonts. Figure 19 shows how each individual eyes see each font, these individual differences are discussed in the Discussion section. Comparing Graph Styles. Shown in Figure 16 are the four different simple graphs that twenty virtual eyes looked at. The results shown in Figure 17 indicate that the upper left and upper right graphs are the easiest to see. Interestingly in Figure 17 the upper left graph has a wider standard deviation, which indicates that due to the differences between the virtual eyes there is more
Personalized Digital Human Models of Perception
1
35
1
PERS va Perceptual Stability
PERSva Perceptual Stability
0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8
Upper Left Lower Left Upper Right Lower Right
0.78 0.76 0.74
2
4
6
8
10
12
14
16
18
0.9
0.85 CB 11pt CB 16pt CB 21pt TR 11pt TR 16pt TR 21pt
0.8
0.75
20
Virtual Eye
Fig. 18. Distribution of P ERSva for each virtual eye for each graph style
0.95
2
4
6
8
10
12
14
16
18
20
Virtual Eye
Fig. 19. Distribution of P ERSva score for each virtual eye for each font style & size
variability between people in how the upper left graph is seen when compared to the upper right graph. Figure 18 shows how each individual eye sees each graph, which is discussed below.
5
Discussion and Conclusions
In the Introduction we asked Can we model the (users) physiological differences and use the models to adapt and personalize designs, user interfaces and artifacts? In this work we have successfully addressed that question. By showing how to model individual physiological differences in eyesight, then demonstrating using the models (along with P ERSva ) to evaluate a set of visual designs. After which, the best scoring visual design for a user’s eyesight is identified and can be shown to a user, i.e. personalizing the visual design by picking from a set of competing visual designs. Also posed in the Introduction was the related question How important are these physical differences between people? Or framing it another way, are there benefits by personalizing based on individual differences in users’ physiology? The results in Figure 18 & 19 establishes that the individual differences between people do matter, though the extend to which they matter depends on the visual design. For some visual designs the physical differences between people matter more, for other designs they matter less. For an example of when the physical differences do matter - in Figure 18 most of the virtual eyes find it easiest to see the Upper Right graph (green circle) in Figure 16, while some of the eyes find it easier to see the Upper Left graph (blue triangle). In Figure 19 we find that the physical differences between individual eyes matter less, as all the virtual eyes agree that the Courier-Bold 21pt (red circle) is easiest to see. An interesting limitation of this work is that our eye model is of low-level early stage vision - predominately concerned with modeling the optics of the human eyeball. There are other differences in human vision which may be worth modeling, e.g. light receptor density and distribution, perception of motion, optical flow, visual crowding [17].
36
M. Bennett and A. Quigley
There are also many opportunities for creating Individual Observer Models for various modalities of human experience, e.g. taste, smell, touch. Though creating Individual Observer Models is challenging because they require quantifying the relationship between a sensation and the perception of the sensation. While also requiring the creation of physiologically valid models of human bodies, and requiring that the Individual Observer Models model individual differences in physical function of the human body. Based on previous user modeling research, there are various user models that quantify the perception of designs and artifacts, so there may be opportunities to tie existing models of perception to individual physiological models of users. Acknowledgements. This research was made possible with support from University College Dublin, SICSA - Scottish Informatics and Computer Science Alliance, Stanford Center for Image Systems Engineering, and Microsoft.
References 1. Bennett, M.: Designing For An Individual’s Eyes: Human-Computer Interaction, Vision And Individual Differences. PhD thesis, College of Engineering, Mathematical And Physical Sciences, University College Dublin, Ireland (2009) 2. Campbell, F.W., Robson, J.G.: Application of fourier analysis to the visibility of gratings. Journal of Physiology 197, 551–566 (1968) 3. Chaffin, D.B.: Digtal human models for workspace design. Reviews of Human Factors and Ergonomics 4, 44–74 (2008) 4. Dong, L.M., Hawkins, B.S., Marsh, M.J.: Consistency between visual acuity scores obtained at different distances. Archives of Ophthalmology 120, 1523–1533 (2002) 5. Ferris, F., Kassoff, A., Bresnick, G., Bailey, I.: New visual acuity charts for clinical research. American Journal of Ophthalmology 94, 91–96 (1982) 6. Freyne, J., Berkovsky, S.: Recommending food: Reasoning on recipes and ingredients. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 381–386. Springer, Heidelberg (2010) 7. Fuller, H., Reed, M., Liu, Y.: Integrating physical and cognitive human models to represent driving behavior. In: Proceedings of Human Factors and Ergonomics Society 54th Annual Meeting, pp. 1042–1046 (2010) 8. Ginsburg, A., Evans, D., Sekuler, R., Harp, S.: Contrast sensitivity predicts pilots’ performance in aircraft simulators. American Journal of Optometry and Physiological Optics 59, 105–108 (1982) 9. Krueger, R.R., Applegate, R.A., MacRae, S.M.: Wavefront Customized Visual Correction: The Quest for Super Vision II (2004) 10. Liu, Y., Feyen, R., Tsimhoni, O.: Queueing network-model human processor (qnmhp): A computational architecture for multitask performance in human-machine systems. ACM Transactions on Computer-Human Interaction 13(1), 37–70 (2006) 11. Mansfield, J., Legge, G., Bane, M.: Psychophysics of reading: Xv. font effects in normal and low vision. Journal of Investigative Ophthalmology and Visual Science 37(8), 1492–1501 (1996) 12. Marsella, S.: Modeling emotion and its expression in virtual humans. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 1–2. Springer, Heidelberg (2010)
Personalized Digital Human Models of Perception
37
13. Norton, T.T., Corliss, D.A., Bailey, J.E.: The Psychophysical Measurement of Visual Function (2002) 14. Palmer, S.E.: Vision Science: Photons to Phenomenlogy (1999) 15. Pelli, D.G., Robson, J.G., Wilkins, A.J.: The design of a new letter chart for measuring contrast sensitivity. Clinical Vision Sciences 2(3), 187–199 (1988) 16. Reed, M., Faraway, J., Chaffin, D.B., Martin, B.J.: The humosim ergonomics framework: A new approach to digital human simulation for ergonomic analysis. In: Digital Human Modeling for Design and Engineering Conference, SAE Technical Papers Series (2006) 17. Rosenholtz, R., Li, Y., Mansfield, J., Jin, Z.: Feature congestion: A measure of display clutter. In: Proceedings of SIGCHI, pp. 761–770 (2005) 18. Salmon, T.O., van de Pol, C.: Normal-eye zernike coefficients and root-mean-square wavefront errors. Journal of Cataract Refractive Surgery 32, 2064–2074 (2006) 19. Thibos, L.N., Applegate, R.A., Schwiegerling, J.T., Webb, R., Members, V.S.T.: Standards for reporting the optical aberrations of eyes. Vision Science and its Applications, 232–244 (February 2000) 20. Thibos, L.N., Bradley, A., Hong, X.: A statistical model of the aberration structure of normal, well-corrected eyes. Journal of Ophthalmic and Physiological Optics 22, 427–433 (2002) 21. West, S.K., Mu˜ noz, B., Rubin, G.S., Schein, O.D., Bandeen-Roche, K., Zeger, S., German, P.S., Fried, L.P.: Function and visual impairment in a population based study of older adults. Journal of Investigative Ophthalmology and Visual Science 38(1), 72–82 (1997)
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring: The Case of Avoiding Bad Annotations of Learning Objects John Champaign1 , Jie Zhang2 , and Robin Cohen1 1
200 University Ave W; Waterloo, Ontario N2L 3G1 Canada 2 School of Computer Engineering Block N4 #02c-110 Nanyang Avenue Singapore 639798 {jchampai,rcohen}@uwaterloo.ca
[email protected] Abstract. In this paper, we examine a challenge that arises in the application of peer-based tutoring: coping with inappropriate advice from peers. We examine an environment where students are presented with those learning objects predicted to improve their learning (on the basis of the success of previous, like-minded students) but where peers can additionally inject annotations. To avoid presenting annotations that would detract from student learning (e.g. those found confusing by other students) we integrate trust modeling, to detect over time the reputation of the annotation (as voted by previous students) and the reputability of the annotator. We empirically demonstrate, through simulation, that even when the environment is populated with a large number of poor annotations, our algorithm for directing the learning of the students is effective, confirming the value of our proposed approach for student modeling. In addition, the research introduces a valuable integration of trust modeling into educational applications.
1
Introduction
In this paper we explore a challenge that arises when peers are involved, in the environment of intelligent tutoring systems: coping with advice that may detract from a student’s learning. Our approach is situated in a scenario where the learning objects1 presented to a student are, first of all, determined on the basis of the benefits in learning derived by similar students (involving a process of preand post-tests to perform assessments). In addition, however, we allow students to leave annotations of those learning objects. Our challenge then becomes to determine which annotations to present to each new student and in particular to be able to cope when there are a large number of annotations which are, in fact, best not to show, to ensure effective student learning. 1
A learning object can be a video, chapter from a book, quiz or anything else a student could interact with on a computer and possibly learn from as described in [1].
Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 38–49, 2011. c Springer-Verlag Berlin Heidelberg 2011
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring
39
Our work is thus situated in the user modeling application area of intelligent e-learning and, in particular, in the context of peer-based intelligent tutoring. We seek to enhance student learning as the primary focus of the user modeling that we perform. Our user modeling in fact integrates a) a modeling of the learning achieved by the students, their current level of knowledge and their similarity to other students and b) a modeling of the trustworthiness of the students, as annotators. The decision of which annotations to ultimately show to each new student is derived, in part, on the basis of votes for and against, registered with each annotation, by previous students. In this respect our research relates as well to the general topic of recommender systems (in a style of collaborative filtering). In the Discussion section we reflect briefly on how our work compares to that specific user modeling subtopic. We ground the presentation of our research and our results very specifically in the context of coping with possible “bad” advice from peers. And we maintain a specific focus on the setting of annotated learning objects. From here, we reflect more generally on advice for the design of peer-based intelligent tutoring systems, in comparison with other researchers in the field, emphasizing the kind of student modeling that is valuable to be performing. We also conclude with a view towards future research. Included in our final discussion is also a reflection on the trust modeling that we perform for our particular application and suggestions for future adjustments. As such, we present as well a few observations on the value of trust modeling for peer-based educational applications.
2
Overview of Model Directing Student Learning
In this section, we present an overview of our current model for reasoning about which learning objects and which annotations to present to a new student, based on a) the previous learning of similar students b) the votes for annotations offered by students with a similar rating behaviour c) a modeling of the annotation’s reputation, based, in part, on a modeling of the overall reputation of the annotator. The user modeling that is involved in this model is therefore a combination of student modeling (to enable effective student learning), similarity of peers (but grounded, in part, in their educational similarity) and trust modeling of students as annotators. Step 1: Selecting a learning object We begin with a repository of learning objects that have previously been assembled to deliver educational value to students. From here, we attach over time the experiences of peers in order to select the appropriate learning object for each new student. This process respects what McCalla has referred to as the “ecological approach” to e-learning [1]. The learning object selected for a student is the one with the highest predicted benefit, where each learning object l ’s benefit to active student s is calculated as [2]:
40
J. Champaign, J. Zhang, and R. Cohen
p[s, l] = κ
n
w(s, j)v(j, l)
(1)
j=1
where v is the value of l to any student j previously exposed to it (which we measure by mapping onto a scale from 0 to 1 the increases or decreases in letter grade post-test assessment compared to pre-test assessment), w is the similarity between active student s and previous student j (measured by comparing current letter grade assessments of achievement levels) and κ is a normalizing factor currently set to n1 Step 2: Allow annotations of learning objects and votes on those annotations As students are presented with learning objects, each is allowed to optionally attach an annotation which may be shown to a new student. Once annotations are shown to students, they register a thumbs up or a thumbs down rating.
Fig. 1. Example of a low-quality annotation, adapted from [3]
The learning object presented in Figure 1 is for tutoring caregivers in the context of home healthcare (an application area in which we currently projecting our research [4]). The specific topic featured in this example is the management of insulin for patients with diabetes. This annotation recommends therapeutic touch (a holistic treatment that has been scientifically debunked, but remains popular with nurse practitioners). It would detract from learning if presented and should be shown to as few students as possible. Consider now that: In an ITS devoted to training homecare and hospice nurses, one section of the material discusses diet and how it is important to maintain proper nutrition, even for terminal patients who often have cravings for foods that will do them harm. One nurse, Alex, posts an annotation saying how in his experience often compassion takes higher precedence than strictly prolonging every minute of the patient’s life, and provides details about how he has discussed this with the patients, their families and his supervisor.
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring
41
This annotation may receive many thumbs up ratings from caregivers who can relate to its advice. Since it is a real world example of how the material was applied, and it introduces higher reasoning beyond the standard instruction, that turns out to be a very worthwhile annotation to show to other students. Some annotations may be effective for certain students but not for others. Consider now: A section on techniques for use with patients recovering from eye surgery in a home healthcare environment has some specific, step-by-step techniques for tasks such as washing out the eye with disinfected water. A nurse, Riley, posts an advanced, detailed comment about the anatomy of the eye, the parts that are commonly damaged, a link to a medical textbook providing additional details and how this information is often of interest to recovering patients. The remedial students struggling with the basic materials find this annotation overwhelming and consistently give the annotation bad ratings, while advanced students find this an engaging comment that enhances the material for them and give it a good rating. Since our approach reasons about the similarity of students, over time, this annotation will be shown to advanced students, but not to students struggling with the material. Some annotations might appear to be undesirable but in fact do lead to educational benefit and should therefore be shown. We present an example below. An annotation is left in a basic science section of the material arguing against an assertion in the text about temperatures saying that in some conditions boiling water freezes faster than cooler water. This immediately prompts negative ratings and follow-up annotations denouncing the original annotator to be advocating pseudo-science. In fact, this is upheld in science (referred to as the Mpemba effect). A student adds an annotation urging others to follow a link to additional information and follow-up annotations confirm that the value of the original comment that was attached.. While, at first glance, the original annotation appeared to be detracting, in fact it embodied and led to a deeper, more sophisticated understanding of the material. Our approach focuses on the value to learning derived from annotations and thus supports the presentation of this annotation. Step 3: Determine which annotations to show a new student Which annotations are shown to a student is decided in our model by a process incorporating trust modeling, inspired by the model of Zhang [5] which determines trustworthiness based on a combination of private and public knowledge (with the latter determined on the basis of peers). Our process integrates i) a restriction on the maximum number of annotations shown per learning object ii) modeling the reputation of each annotation iii) using a threshold to set how valuable any annotation must be before it is shown iv) considering the similarity of the rating behaviour of students and v) showing the annotations with the highest predicted benefit. Let A represent the unbounded set of all annotations attached to the learning object in focus. Let rja = [-1, 1] represent the jth rating that was left on
42
J. Champaign, J. Zhang, and R. Cohen
annotation a (1 for thumbs up, -1 for thumbs down and 0 when not yet rated). The matrix R has Ra representing the set of all ratings on a particular annotation, a, which also represents selecting a column from the matrix. To predict the benefit of an annotation for a student s we consider as Local information the set of ratings given by other students to the annotation. Let the similarity2 between s and rater be S(s, rater). Global information contains all students’ opinions about the author of the annotation. Given a set of annotations Aq = {a1 , a2 , ..., an } left by an annotator (author) q we first calculate the average interest level of an annotation ai provided by the author, given the set of ratings Rai to the ai , as follows: |Rai | ai j=1 rj ai V = (2) |Rai | The reputation of the annotator q is then: |Aq | ai V Tq = i=1 |Aq |
(3)
which is used as the Global interest level of the annotation. A combination of Global and Local reputation leads to the predicted benefit of that annotation for the current student. To date, we have used a Cauchy CDF3 to integrate these two elements into a value from 0 to 1 (where higher values represent higher predicted benefit) as follows: 1 (vF a − vAa ) + Tq 1 arctan( )+ (4) π γ 2 where Tq is the initial reputation of the annotation (set to be the current reputation of the annotator q, whose reputation adjusts over time, as his annotations are liked or disliked by students); vF is the number of thumbs up ratings, vA is the number of thumbs down ratings, with each vote scaled according to the similarity of the rater with the current student, according to Eq. 5. γ is a factor which, when set higher, makes the function less responsive to the vF and vA values. v = v + (1 ∗ S(current, rater)) (5) pred-ben[a, current] =
Annotations with the highest predicted benefit (reflecting the annotation’s overall reputation) are shown (up to the maximum number of annotations to show, where each must have at least the threshold value of reputation). 2
3
The function that we used to determine the similarity of two students in their rating behaviour examined annotations that both students had rated and scored the similarity based on how many ratings were the same (both thumbs up or both thumbs down). The overall similarity score ranged from -1 to 1. Other similarity measures that could be explored are raised in the Discussion section. This distribution has a number of attractive properties: a larger number of votes is given a greater weight than a smaller number (that is, 70 out of 100 votes has more impact than 7 out of 10 votes) and the probability approaches but never reaches 0 and 1 (i.e. there is always a chance an annotation may be shown).
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring
43
There is real merit in exploring how best to set various parameters in order to enable students to achieve effective learning through exposure to appropriate annotations (and avoidance of annotations which may detract from their learning). In the following section, we present our experimental setting for validating the above framework, focusing on the challenge of “bad” annotations.
3
Experimental Setup
In order to verify the value of our proposed model, we design a simulation of student learning. This is achieved by modeling each student in terms of knowledge levels (their understanding of different concepts in the course of study) where each learning object has a target level of knowledge and an impact [2] that increases when the student’s knowledge level is closer to the target. We construct algorithms to deliver learning objects to students in order to maximize the mean average knowledge of the entire group of students (i.e. over all students, the highest average knowledge level of each student, considering the different kinds of knowledge that arise within the domain of application). As mentioned, one concern is to avoid annotations which may detract from student learning. As will be seen in Figure 2, in environments where many poor quality annotations may be left, if annotations are simply randomly selected, the knowledge levels achieved by students, overall, will decline. This is demonstrated in our experiments by comparing against a Greedy God approach which operates with perfect knowledge of student learning gains after an annotation is shown, to then step back to select appropriate annotations for a student. The y-axis in our graphs shows the mean, over all students, of the average knowledge level attained by a student (so, averaged over the different knowledges being modeled in the domain). As well as generating a random set of target levels for each learning object, we also generated a random length of completion (ranging from 30 to 480 minutes) so that we are sensitive to the total minutes required for instruction. The x-axis in each graph maps how student learning adjusts, over time. We used 20 students, 100 learning objects and 20 iterations, repeating the trials and averaging the results. For these experiments we ran what is referred to as the raw ecological approach [2] for selecting the appropriate learning object for each new student; this has each student matched with the learning object best predicted to benefit her knowledge, based on the past benefits in learning achieved by students at a similar level of knowledge as in Step 1 of Section 2. Ratings left by students were simulated by having each student exposed to an annotation providing a score of -1 or 1; we simulated this on the basis of “perfect knowledge”: when the annotation increased the student learning a rating of 1 was left4 . 4
This perfect knowledge was obtained by running the simulated learning twice, once with the annotation and learning object, and once with just the learning object. A student gave a positive rating if they learned more with the annotation and a negative rating if they learned more without.
44
J. Champaign, J. Zhang, and R. Cohen
The standard set-up for all the experiments described below used a maximum of 3 for the number of annotations attached to a learning object that might be shown to a student; a threshold of 0.4 for the minimum reputability of an annotation before it will be shown; a value of 0.5 as the initial reputation of each student; and a value of 20% for the probability that a student will elect to leave an annotation on a learning object. While learning objects are created by expert educators, annotations created by peers may serve to undermine student learning and thus need to be identified and avoided. 3.1
Quality of Annotations
We performed experiments where the quality of annotations from the group of simulated students varied. For each student we randomly assigned an “authorship” characteristic which provided a probability that they would leave a good annotation (defined as an annotation whose average impact was greater than 0). A student with an authorship of 10% would leave good annotations 10% of the time and bad annotations 90% of the time, while a student with an authorship of 75% would leave good annotations 34 of the time and bad annotations 14 of the time. In each condition, we defined a maximum authorship for the students and authorships were randomly assigned, evenly distributed between 0.0 and the maximum authorship. Maximum authorships of 1.0 (the baseline), 0.75, and 0.25 were used. For these set of experiments, we elected to focus solely on Local information to predict the benefit of annotations, i.e. on the votes for and against the annotations presented by peers (but still adjusted according to rater similarity as in Eq. 5). The graphs in Figure 2 indicate that our approach for selecting annotations to show to students (referred to as the Cauchy), in general does well to begin to achieve the learning gains (mean average knowledge) attained by the Greedy God algorithm. The random selection of annotations is not as compromised when there is a greater chance for students to leave good annotations (100% authorship) but degrades as a greater proportion of bad annotations are introduced (and does quite poorly when left to operate in the 25% authorship scenario). This reinforces the need for methods such as ours.
Fig. 2. Comparison of Various Distributions of Bad Annotations
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring
3.2
45
Cutoff Threshold
One approach to removing annotations or annotators from the system is to define a minimum reputation level, below which the annotation is no longer shown to students (or new annotations by an annotator are no longer accepted). A trade-off exists: if the threshold is set too low, bad annotations can be shown to students, if the threshold is set too high, good annotations can be stigmatized. In order to determine an appropriate level in the context of a simulation, we examined cut-off thresholds for annotations first of 0.2 and then of 0.4. We considered the combination of Local and Global information in the determination of which annotations should be shown (as outlined in Step 3 of Section 2). In conjunction with this, we adjusted the initial reputation of all students to be 0.7. Students were randomly assigned an authorship quality (as described in Section 3.1) evenly distributed between 0.0 and 1.0. The results in Figure 3 indicate that our algorithm, both in the case of a 0.4 threshold and that of a 0.2 threshold (together with a generous initial reputation rating of 0.7 for annotator reputation), is still able to propose annotations that result in strong learning gains (avoiding the bad annotations that cause the random assignment to operate less favourably). 3.3
Explore vs. Exploit
Even for the worst annotators, there is a chance that they will leave an occasional good comment (which should be promoted), or improve the quality of their commentary (in which case they should have a chance to be redeemed). For this
Fig. 3. Comparison of Various Thresholds for Removing Annotations
46
J. Champaign, J. Zhang, and R. Cohen
Fig. 4. Explore vs Exploit
experiment, we considered allowing an occasional, random display of annotations to the students in order to give poorly rated annotations and annotators a second chance and to enhance the exploration element of our work. We continued with the experimental setting of Section 3.2, where both Local and Global reputations of annotations were considered. We used two baselines (random and Greedy God again) and considered 4 experimental approaches. The first used our approach as outlined above, the standard authorship of 100%, a cut-off threshold of 0.4 and a 5% chance of randomly assigning annotations. The second used an exploration value of 10%, which meant that we used our approach described above 90% of the time, and 10% of the time we randomly assigned up to 3 annotations from learning objects.We also considered conditions where annotations were randomly assigned 20% and 30% of the time. Allowing a phase of exploration to accept annotations from students who had previously been considered as poor annotators turns out to still enable effective student learning gains, in all cases. Our algorithms are able to tolerate some random selection of annotations, to allow the case where annotators who would have otherwise been cut off from consideration have their annotations shared and thus their reputation possibly increased beyond the threshold (if they offer an annotation of value), allowing future annotations from these students to also be presented.
4
Discussion
We first note that there is value of being aware of possible bad advice from peers and avoiding it – not just for our context but for peer-based intelligent tutoring
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring
47
in general. In [6] the authors deal with the situation of providing incentives to encourage student participation in learning communities. They use a variable incentive model, based on classroom marks, to encourage behaviours helpful to the community of students. For example, if a student shares a small number of good resources, they will be given a greater incentive to contribute more. In the case of students who contribute a reasonable quantity of low-quality resources, the incentive to contribute is lowered, and the user is prompted with a personalized message to try to have them contribute less and to improve their quality. These incentives do not, however, eliminate scenarios where bad annotations may be left. Our work investigates this consideration. In addition, our approach does not focus on adjusting the contribution frequency of various students, but instead looks to preferentially recommend the more worthwhile contributions. We contrast with researchers in peer-based intelligent tutoring who are more focused on assembling social networks for ongoing real-time advice [7,6], as we are reasoning about the past experiences of peers. Some suggestions for how to bring similar students together for information sharing from [8] may be valuable to explore as an extension to our current work. Our research also serves to emphasize the potential value of trust modeling for educational applications (and not just for our particular environment of education based on the selection of learning objects that have brought benefit to similar peers, in the past). As discussed, we are motivated by the trust modeling approach of Zhang [5]. Future work would consider integrating additional variations of Zhang’s original model within our overall framework. For example, we could start to flexibly adjust the weight of Local and Global reputation incorporated in the reasoning about which annotation to show to a student, using methods which learn, over time, an appropriate weighting (as in [5]) based on when sufficient Local information is available and can be valued more highly. In addition, while trust modeling would typically have each user reasoning about the reliability of each other user in providing information, we could have each student maintain a local view of each other student’s skill in annotation (though this is somewhat more challenging for educational applications where a student might learn and then improve their skill over time and where students may leave good annotations at times, despite occasionally leaving poor ones as well). In general, studying the appropriate role of the Global reputation of annotations, especially in quite heterogeneous environments, presents interesting avenues for future research (since currently this value is not in fact personalized for different users). Collaborative filtering recommender systems [9,10,11] are also relevant related work. However, intelligent tutoring systems have an additional motivation when selecting appropriate peer advice, namely to enable student learning. Thus, in contrast to positioning a user within a cluster of similar users, we would like to ideally model a continually evolving community of peers where students at a lower level are removed and more advanced students are added as a student works through the curriculum. This is another direction for future research. Some research on collaborative filtering recommender systems that may be of value for us to explore in the future includes that of Herlocker et al. [11] which explores
48
J. Champaign, J. Zhang, and R. Cohen
what not to recommend (i.e. removing irrelevant items) and that of Labeke et al. [12] which is directly applied to educational applications and suggests a kind of string-based coding of the learning achieved by students, to pattern match with similar students in order to suggest appropriate avenues for educating these new students. Several directions for future work with the model and the simulation would also be valuable to explore. As mentioned previously, we simulated students as accurately rating (thumbs up or thumbs down) annotations based on whether the annotation had helped them learn. It would be interesting to provide for a richer student modeling where each student has a certain degree of “insight”, leading to a greater or lesser ability to rate annotations. If this were incorporated, each student might then elect to be modeling the rating ability of the other peers and this can then be an influence in deciding whether a particular annotation should be shown. It might also be useful to model additional student characteristics such as learning style, educational background, affect, motivation, language, etc. The similarity calculation would need to be updated for such enhancements; similarity should then ideally be modeled as a multi-dimensional measure where an appropriate weighting of factors would need to be considered. Similarity measures such as Pearson coefficients or cosine similarity may then be appropriate to examine. Other variations for our simulations are also being explored. Included here is the introduction of a variation of our algorithm for selecting the learning objects for each student based on simulated annealing (with a view to then continue this simulated annealing approach in the selection of annotations as well). This variation is set up so that during the first 1/2 of the trials there is an inverse chance, based on the progress of the trials, that each student would be randomly associated with a lesson; otherwise the raw ecological approach was applied. We expect this to pose greater challenges to student learning in the initial stages but to perhaps result in even greater educational gains at later stages of the simulation. We note as well that simulations of learning are not a replacement for experiments with human students; however, the techniques explored in this work are useful for early development where trials with human students may not be feasible and future work could look to integrate human subjects as well; we are currently in discussion with possible users in the home healthcare field. While our current use of simulations is to validate our model, we may gain additional insights from the work of researchers such as [13] where simulations help to predict how humans will perform. In conclusion, we offer an approach for coping with bad advice from peers in the context of peer-based intelligent tutoring, employing a repository of learning objects that have annotations attached by students. Our experimental results confirm that there is value to student learning when poor annotations are detected and avoided and we have demonstrated this value through a series of variations of our experimental conditions. Our general message is that there indeed is value to modeling peer trust in educational settings.
Coping with Poor Advice from Peers in Peer-Based Intelligent Tutoring
49
Acknowledgements. Financial support was received from NSERC’s Strategic Research Networks project hSITE.
References 1. McCalla, G.: The ecological approach to the design of e-learning environments: Purpose-based capture and use of information about learners. Journal of Interactive Media in Education 7, 1–23 (2004) 2. Champaign, J., Cohen, R.: A model for content sequencing in intelligent tutoring systems based on the ecological approach and its validation through simulated students. In: Proceedings of FLAIRS-23, Daytona Beach, Florida (2010) 3. Briggs, A.L., Cornell, S.: Self-monitoring Blood Glucose (SMBG): Now and the Future. Journal of Pharmacy Practice 17(1), 29–38 (2004) 4. Plant, D.: hSITE: healthcare support through information technology enhancements. NSERC Strategic Research Network Proposal (2008) 5. Zhang, J., Cohen, R.: Evaluating the trustworthiness of advice about seller agents in e-marketplaces: A personalized approach. Electronic Commerce Research and Applications 7(3), 330–340 (2008) 6. Cheng, R., Vassileva, J.: Design and evaluation of an adaptive incentive mechanism for sustained educational online communities. User Model. User-Adapt. Interact. 16(3-4), 321–348 (2006) 7. Read, T., Barros, B., B´ arcena, E., Pancorbo, J.: Coalescing individual and collaborative learning to model user linguistic competences. User Modeling and UserAdapted Interaction 16(3-4), 349–376 (2006) 8. Brooks, C.A., Panesar, R., Greer, J.E.: Awareness and collaboration in the ihelp courses content management system. In: EC-TEL, pp. 34–44 (2006) 9. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17, 734–749 (2005) 10. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering, pp. 43–52. Morgan Kaufmann, San Francisco (1998) 11. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004) 12. van Labeke, N., Poulovassilis, A., Magoulas, G.D.: Using similarity metrics for matching lifelong learners. In: Woolf, B.P., A¨ımeur, E., Nkambou, R., Lajoie, S. (eds.) ITS 2008. LNCS, vol. 5091, pp. 142–151. Springer, Heidelberg (2008) 13. VanLehn, K., Ohlsson, S., Nason, R.: Applications of simulated students: An exploration. Journal of Artificial Intelligence in Education 5, 135–175 (1996)
Modeling Mental Workload Using EEG Features for Intelligent Systems Maher Chaouachi, Imène Jraidi, and Claude Frasson HERON Lab, Computer Science Department University of Montreal, 2920 chemin de la tour, H3T 1N8, Canada {chaouacm,jraidiim,frasson}@iro.umontreal.ca
Abstract. Endowing systems with abilities to assess a user’s mental state in an operational environment could be useful to improve communication and interaction methods. In this work we seek to model user mental workload using spectral features extracted from electroencephalography (EEG) data. In particular, data were gathered from 17 participants who performed different cognitive tasks. We also explore the application of our model in a non laboratory context by analyzing the behavior of our model in an educational context. Our findings have implications for intelligent tutoring systems seeking to continuously assess and adapt to a learner’s state. Keywords: cognitive workload, EEG, ITS.
1 Introduction Modeling and developing systems able to assess and monitor users’ cognitive states using physiological sensors has been an important research thrust over few past decades [1-5]. These physio-cognitive systems aim to improve technology’s adaptability and have shown a significant impact on enhancing users’ overall performance, skill acquisition and productivity [6]. Various models tracking shifts in users’ alertness, engagement and workload have been successfully used in closed-loop systems or simulation environment [7-9]. By assessing users’ internal state, these systems were able to adapt to users’ information processing capacity and then to respond accurately to their needs. Mental workload is of primary interest as it has a direct impact on users’ performance in executing tasks [10]. Even though there is no agreement upon its definition, mental workload can be seen in terms of resources or mental energy expended, including memory effort, decision making or alertness. It gives an indication about the amount of effort invested as well as users’ involvement level. Hence, endowing systems with workload assessment models can provide intelligent assistance, efficient adaptation and more realistic social communication in the scope of reaching optimal interaction conditions. Nevertheless, scarce and scattered research has explored these faculties to refine the learning process within educational settings. Intelligent tutoring systems (ITS) are still mainly based on learners’ performance in analyzing learners’ skill acquisition process Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 50–61, 2011. © Springer-Verlag Berlin Heidelberg 2011
Modeling Mental Workload Using EEG Features for Intelligent Systems
51
or evaluating their current engagement level and the quality of learning [11-14]. Even though the integration of affective models in ITS added an empathic and social dimension into tutors’ behaviors [15-17] there is still a lack of methods helping tutors to drive the learning process and evaluate learners’ behavior according to their mental effort. The limited action range offered to learners (menu choice, help, or answers) restricts the ability of forecasting learners’ memory capacity and objectively assessing their efforts and implication level [18]. EEG techniques for workload assessment can represent, then, a real opportunity to address these issues. The growing progress in developing non intrusive, convenient and low cost EEG headsets and devices are very promising enabling their use in operational educational environments. In this paper we model users’ workload in a learning environment by developing an EEG workload index and we analyze its behavior in different phases across the learning process. In particular, we performed an experiment with two phases: (1) A cognitive phase, in which users performed different cognitive tasks, was used to derive the workload index. (2) A learning phase during which the developed index was validated and analyzed. We performed data analysis using machine learning techniques and showed that there are identifiable trends in the behavior of the developed index. The rest of this paper is structured as follows. Section 2 presents previous work on EEG based workload detection approaches. Section 3 presents our experimental methodology. Section 4 discusses our approach and its implications. We conclude in Section 5.
2 Previous Work Developing EEG indexes for workload assessment is a well developed field especially in laboratory contexts. A variety of linear and non-linear classification and regression methods were used to determine mental workload in different kinds of cognitive tasks such as memorization, language processing, visual, or auditory tasks. These methods use mainly EEG Power Spectral Density (PSD) bands or Event Related Potential (ERP) techniques to extract relevant EEG features [7-9]. Wilson [19] used an Artificial Neural Network (ANN) to classify operators’ workload level by taking EEG as well as physiological features as an input. Reported results showed up to 90% of classification accuracy. Gevins and Smith [20] used spectral features to feed neural networks classifying workload of users while performing various memory tasks. In a car driving simulation environment, Kohlmorgen et al. [21] used Linear Discriminant Analysis (LDA) on EEG-features extracted and optimized for each user for workload assessment. Authors showed that decreasing driver workload (induced by a secondary auditory task) improves reaction time. Support Vector Machine (SVM) and ANN were also used to analyze task demand recorded in lecture and meeting scenario as well as in others cognitive tasks (Flanker paradigm and Switching paradigm) using EEG features. Results reached 92% of accuracy in discriminating between high and low workload levels [22, 23]. Berka and colleagues [1, 24] developed a workload index using Discriminant Function Analysis (DFA) for monitoring alertness and cognitive workload in different environments. Several cognitive tasks such as grid location task, arithmetic computing,
52
M. Chaouachi, I. Jraidi, and C. Frasson
and image memorization were analyzed to validate the proposed index. The same index was used in an educational context to analyze students’ behavior while acquiring skills in a problem solving context [18, 25, 26]. In this paper, we propose to model users’ workload from EEG extracted features through a cognitive task activity. The major contribution of this study is to validate the model within a learning activity as opposed to the work of Berka and colleagues [1, 24] where the proposed index was validated only according to purely cognitive tasks. Our study uses Gaussian Process Regression to train workload models in a first phase especially designed to elicit different levels of workload and applied in a second phase, within a learning environment to detect different trends in learners’ mental workload behavior. We will now describe our experiment.
3 Methodology The aim of this study was to model and evaluate mental workload induced during human-computer interaction using features extracted from EEG signals. The experimental process was divided into two phases: a cognitive activity phase including three cognitive tasks designed with incrementally increasing levels of difficulty to elicit increasing levels of required mental workload and a learning activity phase about trigonometry consisting of three main steps. Data gathered from the first phase were used to thoroughly derive a workload index whereas data from the second phase were used to validate the computed index in a “non-laboratory” context. Our experimental setup consists of a 6-channel EEG sensor headset and two video feeds. All recorded sessions were replayed and analyzed to accurately synchronize the data using necessary time markers. 17 participants were recruited for this research. All participants were briefed about the experimental process and objectives and signed a consent form. Participation was compensated with 20 dollars. Upon their arrivals, participants were equipped with the EEG-cap and familiarized with the materials and the environment. All subjects completed a five minutes eyes open baseline followed by another five minutes eyes closed baseline. During this period, participants were instructed neither to be active nor to be relaxed. This widely used technique enabled us to establish a neutral reference for the workload assessment. Then, participants completed the cognitive activity phase. This phase consists of three successive tasks: (1) Forward Digit Span (FDS) (2) Backward Digit Span (BDS) and (3) Logical Tasks (LT). Each task has between three and six difficulty levels. All participants performed these tasks in the same order and were allowed to self-pace with respect to the time required to complete each task. Forward Digit Span (FDS). This test involves attention and working memory abilities. In this task, a series of single digits were successively presented on the screen. Participants were asked to memorize the whole sequence, then prompted to enter the digits in the presented order. This task included 6 difficulty levels incrementally increasing by increasing the length of the sequence that participants have to retain. Level one consisted of a series of 20 sets of 3 digits, level two: 12 sets of 4 digits, level three: 8 sets of 5 digits, level four: 6 digits and 6 sets, level five: 7 digits and 4 sets and level six: 4 sets of 8 digits.
Modeling Mental Workload Using EEG Features for Intelligent Systems
53
Backward Digit Span (BDS). The principle of this test is similar to the FDS task. Participants had to memorize a sequence of single digits presented on the screen. The difference was that they were instructed to enter digits in the reverse order from the one presented. Five levels of difficulty were considered by increasing the number of digits in the sequence. The first level consisted of a series of 12 sets of 3 digits; the second level: 12 sets of 4 digits, the third level: 5 digits and 8 series, the fourth level: 6 sets of 6 digits and the fifth level: 4 sets of 8 digits. No time limit was fixed for FDS and BDS tasks. Logical Tasks (LT). These tasks require inferential skills on information series and are typically used in brain training exercises or in tests of reasoning. In these tasks, participants were instructed to deduce a missing number according to a logical rule that they had to infer from a series of numbers displayed on the screen, within a fixed time limit of 30 seconds. An example of such series is “38 - 2 - 19 - 9 - 3 - 3 - 40 - 4 ?” and one should deduce that the missing element (“?”) would be “10”. That is, the logical rule that one should guess is that for each group of three numbers the last number is the result of the division of the first by the second. The logical tasks involved three difficulty levels. Each level consisted of a series of 5 questions and the difficulty level was manipulated by enhancing the difficulty of the logical rule between the data. After completing the cognitive activity phase, participants took a little break, and then were invited to perform the learning phase during which a trigonometry course was given. This phase consists of three successive steps (1) a pretest session, (2) a learning session and (3) a problem solving session. Before starting these tasks, participants were asked to report their self-estimated skill level in trigonometry (low, moderate or expert). Pretest. This task involved 10 (yes/no/no-response) questions that covered some basic aspects of trigonometry (for instance: “is the tangent of an angle equal to the ratio of the length of the opposite over the length of the adjacent?”). In this part, participants had to answer to the questions without any interruption, help or time limit. Learning Session. In this task, participants were instructed to use a learning environment covering the theme of trigonometry and specially designed for the experiment. Two lessons were developed explaining several fundamental trigonometric properties and relationships. The environment provides basic definitions as well as their mathematical demonstrations. Schemas and examples are also given for each presented concept.1 Problem Solving. Problems presented during this task are based on participants’ ability to apply, generalize and reason about the concepts seen during the learning session. No further perquisites were required to successfully resolve the problem except the lessons’ concepts. However a good level of reasoning and concentration is 1
At the end of the experiment, all participants reported that they were satisfied with the quality of the environment as well as the pedagogical strategy used for presenting the materials.
54
M. Chaouachi, I. Jraidi, and C. Frasson
needed to solve the problems. A total of 6 problems with a gradually increasing difficulty level were selected and presented in the same order for all the participants. Each problem is a multiple-choice question illustrated by a geometrical figure. A fixed time limit is imposed for each problem varying according to its difficulty level. Each problem requires some intermediate steps to reach the final solution and the difficulty level was increased by increasing the number of required intermediate steps. For example, to compute the sinus of an angle, learners had first to compute the cosines in the first step. Then, they had to square the result and to subtract it from 1 in the second step. Finally the third step consisted of computing the square root. The problem solving environment provided a limited number of hints for each problem as well as a calculator to perform the needed computations. All the problems were independent in terms of learned concepts except for problem 4 and 6 that shared the same geometrical rule required to solve the problem (i.e. “the sum of the angles of a triangle is equal to 180 degrees”). 3.1 Subjective and Objective Measures of Workload After completing each task level, participants were asked to evaluate their mental workload both in the cognitive activity phase and the learning activity phase. We used the NASA-Task Load Index (NASA-TLX) technique [27]. As for other subjective measures of workload, NASA-TLX relies on subjects’ conscious perceived experience with regards to the effort produced and the difficulty of task. NASA_TLX has the advantage of being quick and simple to administer. In addition to the subjective ratings, other objective factors that may be used for assessing workload were considered in this study, such as performance (i.e. proportion of correct answers in cognitive tasks, pretest and problem solving) and response time. 3.2 EEG Recording EEG is a measurement of the brain electrical activity produced by synaptic excitations of neurons. In this experiment, EEG was recorded using a stretch electro-cap. EEG signals were received from sites P3, C3, Pz and Fz as defined by the international 1020 electrode placement system (Jasper 1958). Each site was referenced to Cz and grounded at Fpz. Two more active sites were used namely A1 and A2 typically known respectively as the left and right earlobe. This setup is known as “referential linked ear montage” and is illustrated in figure 1. Roughly speaking, in this montage the EEG signal is equally amplified throughout both hemispheres. Moreover, the “linked ears” setup yields a more precise and cleaner EEG signal by calibrating each scalp signal to the average of the left and right earlobe sites (A1 and A2). For example, the calibrated C3 signal is given by (C3-(A1+A2)/2). Each scalp site was filled with a non-sticky proprietary gel from Electro-Cap and impedance was maintained below 5 Kilo Ohms. Any impedance problems were corrected by rotating a blunted needle gently inside the electrode until an adequate signal was obtained. The recorded sampling rate was at 256 Hz. Data Preprocessing and Features Extraction. Due to its weakness (on the order of micro volts: 10-6 volts), the EEG signal needs to be amplified and filtered. The brain
Modeling Mental Workload Using EEG Features for Intelligent Systems
55
electrical activity signal is usually contaminated by external noise such as environmental interference caused by surrounding devices. Such artifacts alter clearly the quality of the signal. Thus a 60-Hz notch filter was applied during the data acquisition to remove these artifacts. In addition, the acquired EEG signal easily suffers from noise caused by user body movements or frequent eye blinks or movement. Therefore, an artifact rejection heuristic was applied to the recorded data using a threshold on the signal power with regards to the eyes open and eyes closed baseline. If the amplitude of any epoch in any channel exceeded the threshold by more than 25%, the epoch was considered as contaminated and was excluded from subsequent analysis.
Fig. 1. EEG channel electrode placement
For each participant, EEG data recorded from each channel were transformed into a power spectral density using a Fast-Fourier Transform (FFT) applied to each 1-second epoch with a 50 % overlapping window multiplied by the Hamming function to reduce spectral leakage. Bin powers (the estimated power over 1 Hz) of each channel ranging from 4 Hz to 48 Hz were concatenated to constitute the feature vector. To sum up, 176 features (4 channels x 48 bins) were extracted from the signal each second. To reduce the data input dimensionality and improve the workload model, a Principal Component Analysis (PCA) was applied on the EEG data of each participant. The number of features was reduced to 25 (78.63% reduction rate) explaining 85.98% to 94.71% of the variability (M = 90.42%, SD = 3.30%). PCA scores were then z-normalized and the resulting vector used as an input for the model.
4 Results and Discussions The experimental results are presented in the following subsections. The first part is concerned with the cognitive phase data analysis. The second part describes mental workload modeling from the EEG. The third part deals with the validation and evaluation of the model within the learning activity. 4.1 Cognitive Activity Results In order to evaluate NASA_TLX subjective estimates of mental workload, correlational analysis was performed with regards to the task design and performance and response time objective variables.
56
M. Chaouachi, I. Jraidi, and C. Frasson
Fig. 2. Mean NASA_TLX workload estimate for each difficulty level for the forward digit span, backward digit span and logical tasks
Repeated measures one-way ANOVA (N =17) was performed in the FDS, LT, and BDS cognitive activities across their associated difficulty levels. Degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity for FDS and BDS (epsilon = 0.35 and 0.54 respectively) as the assumption of sphericity had been violated (chi-square = 60.11 and 54.40 respectively, p 25% .746 36202 f requency > 50% .751 21401 f requency > 75% .777 21270 horizontal baseline .101 89739 all; weighted by frequency .521 59857 f requency > 25% .513 36202 f requency > 50% .551 21401 f requency > 75% .580 21270
Regression model constant weight sig. 345.133 .228 .000* 137.826 .723 .000* 142.850 .725 .000* 139.159 .701 .000* 153.598 .696 .000* 577.499 .75 .000* 252.650 .596 .000* 265.610 .579 .000* 202.246 .682 .000* 170.052 .764 .000*
152
D. Hauger, A. Paramythis, and S. Weibelzahl
Table 3. Frequency of element hovered by mouse matches element currently being looked at based on frequency of mouse moves – overall and while mouse being moved Frequency overall filter: while mouse moved level of Frequency: Standard Frequency: Standard N N mouse moves match Deviation match Deviation 0%-25% 21,00% .404 24331 59,00% .493 12555 25%-50% 51,00% .500 9461 60,00% .490 8512 50%-75% 70,00% .458 3478 82,00% .387 3314 75%-100% 72,00% .451 13921 72,00% .449 13844 Total 26,00% .441 51191 65,00% .477 38225
H1.c: Do gaze and mouse point at the same paragraph on the screen? In general, the element pointed at with the mouse coincides with the paragraph looked at in 26% of the cases. When limiting the analysis to cases where people use the mouse a lot, this rises up to 72% (see Table 3). Again, more restrictive frequency filters increase the likelihood that the paragraphs are the same. H1.d: Are H1.c predictions better while the mouse is in motion? In line with H1.b results, predicting which paragraph has been looked at is easier when the mouse is in motion. In particular, for users that do not use the mouse a lot (frequency level 0% − 25%), prediction increases strongly (compare columns “overall” and “filter: while mouse moved” in Table 3). H1.e: If vertical predictions are better, should we select vertical moves rather than just frequent moves in any direction? While in all cases the predictions were better than the baseline and followed the same trends as the previous results (e.g., in motion better than not in motion), frequency of vertical movements did not improve prediction over the levels observed for general frequency of mouse movements (e.g., r = .397 for vertical moves vs. r = .528 for general moves). H2.a-c: When the mouse is actively used, users are likely to look at the region the mouse is positioned. The mean distance of mouse and eye position reduces to less than 50% when users are clicking, selecting text, or when the mouse is moving (see Tables 4 and 5). Again, the horizontal correlation is lower than the vertical. This is in particular true for text selection activities, where users seem to read left to right, but keep the mouse at one end of the selected text. However, this improvement of prediction comes at the expense of very limited coverage. In short, when mouse actions occur, predictions will be good, but clicks, selections and movements occur only for a fraction of the total observation time. H3, H4, H5: While we could not establish statistically significant support for these hypotheses, this may partly be due to the type of task we set. For Table 4. Mean distances in pixels between mouse cursor and eye gaze for selected types of interactions click
no yes select no yes in move no yes
N mean distance Std. Error F Sig 86838 383.9 .746 796.5 .000* 2901 163.4 7.77 89706 382.0 .746 26.31 .000* 33 136.3 47.8 29882 404.7 .768 7063.5 .000* 59857 222.1 2.033
Using Browser Interaction Data to Determine Page Reading Behavior
153
Table 5. Regression models for user interactions event Correlation filter reyevs mouse N vertical baseline .250 89739 click .873 2901 select .986 33 in move .672 59857 horizontal baseline .101 89739 click .808 2901 select .494 33 in move .436 59857
Regression model constant weight sig. 345.133 .228 .000* 83.245 .820 .000* 64.388 .826 .000* 161.966 .659 .000* 577.499 .075 .000* 98.057 .774 .000* 334.191 .579 .004* 330.684 .435 .000*
instance, we observed only a limited number of scrolling-up events (H4) and very few instances of small increment scrolling (H5). The analysis for relative areas on the screen (e.g., top, middle, bottom) seems to be invalidated by the fact that almost everybody gazed at the middle part of the screen for the majority of time (H3) (see Fig. 2); this finding (i.e. users tend to scroll down just for a few lines while they are reading to keep the currently read item at the center of the screen), however, is in itself also quite useful in establishing a prediction algorithm as we see later.
3 3.1
From Hypotheses to Algorithm General Structure of the Algorithm
Based on the findings outlined in the previous section, an algorithm was developed to calculate the extent to which paragraphs (or more generally: text fragments) of a page have been read. The main premise of the algorithm is the “splitting” of the time spent reading between the items visible at that time. Therefore each page view is split into “scroll windows”, i.e. the time window where the visible items and their relative position on the screen remain constant (identified as the time spans between load, scroll or resize events). For each such scroll window, the algorithm first calculates the “estimated time spent reading” (TE ). This is based on the measured “available” duration of the
Fig. 2. Histogram of vertical eye position within the screen
154
D. Hauger, A. Paramythis, and S. Weibelzahl
scroll window (TA ), but also takes into consideration interaction data that may provide additional information. For instance, if users usually exhibit considerable mouse activity, and then suddenly stop interacting, it is possible that they have not been continuously reading. The motivation behind the introduction of TE is the derivation of a time measure that potentially more accurately represents the real time that users spent reading during a scroll window. −−−→ To get the time spent reading for each visible fragment (T SR), TE is split among the visible page fragments by multiplying it with a vector defining the percentage of time that should be assigned to each fragment. This vector is a −−−→ weighted average of a number of normalized distributions of time (T DN ) created by different modifier functions (hereforth referred to as “modifiers”), each focusing on a different aspect, for instance, the number of words in the different paragraphs, the number of interactions, the relative position of the paragraph within the screen, etc (see section 3.2). Each modifier receives as input the interaction data of the scroll window, and provides the following output values: – wIN T : The internal weight of the modifier, which provides an indication of the modifier’s relative significance for a given scroll window. For instance, a modifier based on text selections would return a wIN T of zero if no selections were made during a scroll window, as it can not provide any predictions. −−−→ – T DN : The modifier’s normalized distribution of time over the text fragments (partially or entirely) visible during the scroll window. The result is a vector of weights for each such fragment. – T% : The modifier’s estimated percentage of the total available time (TA ) the user spent reading in a scroll window. – wT IME : A weight to be used in association with T% . Similar to the internal weight for the time distribution this value is the internal weight for the estimation on the percentage of time a user spent reading. Further to the above, each modifier has an “external weight” (wEXT ), which denotes the relative significance of a modifier over others. A modifier based on text selections for instance provides stronger indicators of reading behavior than one based on fragment visibility. Based on the above, TE is defined as follows: TE = TA ·
NM
i=1 wEXTi · T%i · wT IMEi NM i=1 wEXTi · wT IMEi
where NM is the total number of modifiers applied. The final algorithm can then be described as follows: NM −−−→ wEXTi · wIN Ti · T DNi −−−→ T SR = TE · i=1 NM i=1 wEXTi · wIN Ti −−−→ where T SR is the column vector containing the calculated time spent reading for each visible text fragment.
Using Browser Interaction Data to Determine Page Reading Behavior
155
The external weight of the modifiers is the only part of the algorithm that is not directly derived from user interaction. Our first experiments had already shown which interactions should get stronger weights (e.g., text selections). Combining these results with the more recent findings (specifically with the identified strength of the correlation for confirmed hypotheses), allowed us to arrive at a set of weights that were used to derive the results described in section 4. Note that we do not consider these weights to be final or absolute. We expect that adjustments may be needed to cater for specific characteristics of the reading context. Nevertheless, there are two points that merit attention: (a) the derived weights appear to have only little sensitivity over the type of text being read; and, (b) even in a “worst case” scenario with all weights set to 1 (equivalent to no knowledge of the expressiveness of different interaction patterns) the algorithm still classified 73.3% of the paragraphs correctly (92.9% with a maximum error of 1 level); please refer to Section 4 for a discussion of these percentages. 3.2
The Weight Modifiers
Currently there are six implemented modifiers focusing on different aspects of the interaction data. Due to lack of space we provide here only a brief outline of each modifier, along with its base hypotheses and external weight: MSelect : This modifier is based on text tracing, i.e., selecting portions of text while reading [16], which is a strong indicator of current reading. In all our experiments it was both the strongest indicator, but also the least frequent type of interaction. (H2.c, wEXT = 150) MClick : Based on mouse clicks, which, like text selections, are a strong indicator of current reading. If users click on fragments / paragraphs, this modifier splits the available time among them. (H2.b, wEXT = 70) MMove : Based on the users’ tendency to move their mouse while reading. This modifier sets weights according to the time the mouse cursor has been moved above a fragment. The more users tend to move their mouse, the stronger the weight of this modifier. (H1.a-d and particular H1.c-d, wEXT = 45) MMouseP ositions : Even if the mouse is not moved the position of the cursor may be used to identify the area of interest. This modifier considers the placement of the mouse over a fragment, as well as its placement in at a position that falls within the vertical constraints of the fragment (e.g., in the white-space area next to the text). (H1.e, wEXT = 45) MScreenAreas : Even if there are only few interactions we may make further assumptions on what has been read. Most people prefer to read in the center of the screen, so if the page is long enough that a user could scroll up or down (the first and last paragraphs of a page definitely have to be read while on top/bottom of the screen), this modifier puts its weight on the centered 80% of the page. A more fine-grained distribution over different parts of the screen or additional knowledge on the user’s preferred reading area might improve a future version of this modifier. (adjusted H3 as per Fig. 2, wEXT = 5) MV isibility : The simplest modifier, this one just splits the time among all visible paragraphs based on the number of words they contain. (wEXT = 1)
156
D. Hauger, A. Paramythis, and S. Weibelzahl
Table 6. Classification distance of para- Table 7. Classification distance of paragraphs in the Go course graphs in Questions page Dist. # Par. % Cumulative % 0 746 78.7% 78.7% 1 143 15.1% 93.8% 2 47 5.0% 98.7% 3 12 1.3% 100.0% Total 948 100.0%
Dist. # Par. % Cumulative % 0 41 85.4% 85.4% 1 3 6.3% 91.7% 2 2 4.2% 95.8% 3 2 4.2% 100.0% Total 48 100.0%
Table 8. Classification of paragraphs split by the actual reading level (L0-3) – context: Go course L0 # Par. % L1 # Par. % L2 # Par. % L3 # Par. % L0 596 89.1% L0 23 26.7% L0 7 8.0% L0 0 .0% L1 46 6.9% L1 34 39.5% L1 15 17.2% L1 14 13.2% L2 15 2.2% L2 18 20.9% L2 43 49.4% L2 19 17.9% L3 12 1.8% L3 11 12.8% L3 22 25.3% L3 73 68.9% Total 669 100.0% Total 86 100.0% Total 87 100.0% Total 106 100.0%
4
Results
In order to evaluate our algorithm we measured the reading speed of each user (rate of words per minute). We used that rate, along with the number of words in each paragraph, to estimate the time the user would require for reading it (Tpreq ). We then used that in conjunction with the time the user spent on the paragraph, as per the algorithm’s predictions (Tppred ), to define four “levels” of reading for paragraphs: – – – –
level level level level
0 1 2 3
(paragraph (paragraph (paragraph (paragraph
skipped): Tppred < 0.3 · Tpreq glanced at): 0.3 · Tpreq ≤ Tppred < 0.7 · Tpreq read): 0.7 · Tpreq ≤ Tppred < 1.3 · Tpreq read thoroughly): 1.3 · Tpreq ≤ Tppred
The user’s fixations have been used to calculate the baseline reading level our algorithm should be compared against. Table 6 shows the absolute distances between the calculated reading level and the baseline from the eye tracking data. In 78.7% of all cases the algorithm was able to classify the paragraph correctly. However, not only the exact matches, but also the difference between the baseline category and the level selected is important. In 93.8% of all cases this distance is only 0 or 1. Table 8 shows in more detail how paragraphs of each level have been categorized by the algorithm. The highest precision was reached for paragraphs that have been skipped or read thoroughly. However, even for the intermediate levels the algorithm classified most paragraphs correctly. The focus of our experiment was to test the algorithm in the context of reading learning materials. Nevertheless, it is worth noting that the algorithm performs comparably well in the other contexts tested. For example, on pages where users answered questions (a task that inherently requires more interaction), the
Using Browser Interaction Data to Determine Page Reading Behavior
157
algorithm performed even better than in the case of the Go course (see Table 7). However, we concentrate on the learning scenario where it is more difficult to get valid information due to reduced requirements for interaction.
5
Conclusions and Ongoing Work
This paper has demonstrated that it is possible to predict, with satisfactory precision, the users’ reading behavior on the basis of client-side interaction. In our experiments, users visited all pages of provided hypertext material. A traditional AHS might, thus, assume everything has been read. In contrast, using the proposed approach, we were able to determine that 70% of the paragraphs were not read, and users focused on certain paragraphs instead of reading entire pages. Our experiment has shown that the algorithm, using mouse and keyboard events, can correctly identify a paragraph’s “reading level” in 78.7% of all cases (and in 93.8% of the cases calculate the correct level ±1). The algorithm, in its current form, has weaknesses that need to be addressed. To start with, it is geared towards pages that contain one main column of text. While this may be typical for learning content, enhancements are required before the algorithm can satisfactorily handle multi-column page content. A related question is how well the algorithm might perform in mobile settings, with different screen factors (and, therefore, different amounts of text visible at a time) and potentially different interaction patterns (brought forth by the screen factor, or by alternative input techniques available). Another area that requires further work is the establishment of the effects of external modifier weights in different reading contexts (e.g., with less text visible at a time, the visible part of a page may be a stronger indicator on what is currently being read). Among the strengths of this algorithm is its extensibility. For example, additional input devices may be easily integrated through client-side “drivers” and the introduction of corresponding modifiers (e.g. a webcam, eye tracking, etc.). The same is true for interaction patterns that may be established as evidence for reading behavior in the future. Further to the above, and specifically in the domain of learning, we intend to test the effects of having access to predictions of reading behavior on learner models and their use in adaptive educational hypermedia systems. Our next experiment will use the presented algorithm to make predictions on which questions relating to course content a learner is likely to be able to answer, based on what that learner has (been predicted to have) read from that content. Finally, as soon as the algorithm has matured and been shown to be of general applicability, we intend to make the implementation (along with the accompanying JavaScript library for monitoring) publicly available. Acknowledgments. The work reported in this paper is partially funded by the “Adaptive Support for Collaborative E-Learning” (ASCOLLA) project, supported by the Austrian Science Fund (FWF; project number P20260-N15).
158
D. Hauger, A. Paramythis, and S. Weibelzahl
References 1. Hofmann, K., Reed, C., Holz, H.: Unobtrusive Data Collection for Web-Based Social Navigation. In: Workshop on the Social Navigation and Community based Adaptation Technologies (2006) 2. Stephanidis, C., Paramythis, A., Karagiannidis, C., Savidis, A.: Supporting Interface Adaptation: The AVANTI Web-Browser. In: Stephanidis, C., Carbonell, N. (eds.) Proc. of the 3rd ERCIM Workshop “User Interfaces for All” (1997) 3. Goecks, J., Shavlik, J.W.: Learning Users’ Interests by Unobtrusively Observing Their Normal Behavior. In: Int. Conference on Intelligent User Interfaces - Proc. of the 5th Int. Conference on Intelligent User Interfaces, pp. 129–132 (2000) 4. Spada, D., S´ anchez-Monta˜ n´es, M.A., Paredes, P., Carro, R.M.: Towards Inferring Sequential-Global Dimension of Learning Styles from Mouse Movement Patterns. In: Nejdl, W., Kay, J., Pu, P., Herder, E. (eds.) AH 2008. LNCS, vol. 5149, pp. 337–340. Springer, Heidelberg (2008) 5. Cha, H.J., Kim, Y.S., Park, S.-H., Yoon, T.-b., Jung, Y.M., Lee, J.-H.: Learning Styles Diagnosis Based on User Interface Behaviors for the Customization of Learning Interfaces in an Intelligent Tutoring System. In: Ikeda, M., Ashley, K.D., Chan, T.-W. (eds.) ITS 2006. LNCS, vol. 4053, pp. 513–524. Springer, Heidelberg (2006) 6. Claypool, M., Le, P., Wased, M., Brown, D.: Implicit interest indicators. In: Intelligent User Interfaces, pp. 33–40 (2001) 7. Hauger, D.: Using Asynchronous Client-Side User Monitoring to Enhance User Modeling in Adaptive E-Learning Systems. In: Houben, G.-J., McCalla, G., Pianesi, F., Zancanaro, M. (eds.) UMAP 2009. LNCS, vol. 5535. Springer, Heidelberg (2009) 8. Hauger, L.D., Van Velsen, L.: Analyzing Client-Side Interactions to Determine Reading Behavior. In: Adaptivity and User Modeling in Interactive Systems, ABIS 2009, Darmstadt, Germany, pp. 11–16 (2009) 9. Ullrich, C., Melis, E.: The Poor Man’s Eyetracker Tool of ActiveMath. In: Proceedings of the World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education (eLearn 2002), Canada, vol. 4, pp. 2313–2316 (2002) 10. Chen, M.C., Anderson, J.R., Sohn, M.H.: What can a mouse cursor tell us more?: correlation of eye/mouse movements on web browsing. In: CHI 2001: CHI 2001 Ext. Abstracts on Human Factors in Computing Systems, pp. 281–282. ACM, New York (2001) 11. Chee, D.S.L., Khoo, C.S.: Users’ Mouse/Cursor Movements in Two Web-Based Library Catalog Interfaces. In: Khalid, H., Helander, M., Yeo, A. (eds.) Work with Computing Systems, pp. 640–645. Damai Sciences, Kuala Lumpur (2004) 12. Rodden, K., Fu, X., Aula, A., Spiro, I.: Eye-mouse coordination patterns on web search results pages. In: CHI 2008: CHI 2008 Extended Abstracts on Human Factors in Computing Systems, pp. 2997–3002. ACM, New York (2008) 13. Mueller, F., Lockerd, A.: Cheese: tracking mouse movement activity on websites, a tool for user modeling. In: CHI 2001: CHI 2001 Extended Abstracts on Human Factors in Computing Systems, pp. 279–280. ACM, New York (2001) 14. Liu, C.C., Chung, C.W.: Detecting Mouse Movement with Repeated Visit Patterns for Retrieving Noticed Knowledge Components on Web Pages. IEICE - Transactions on Information and Systems E90-D, 1687–1696 (2007) 15. Hauger, D., Paramythis, A., Weibelzahl, S.: Your Browser is Watching You: Dynamically Deducing User Gaze from Interaction Data. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 10–12. Springer, Heidelberg (2010) 16. Hijikata, Y.: Implicit user profiling for on demand relevance feedback. In: IUI 2004: Proceedings of the 9th International Conference on Intelligent User Interfaces, pp. 198–205. ACM, New York (2004)
A User Interface for Semantic Competence Profiles Martin Hochmeister and Johannes Daxb¨ock Electronic Commerce Group, Vienna University of Technology Favoritenstraße 9-11/188-4, 1040 Vienna, Austria
[email protected],
[email protected] Abstract. Competence management systems are increasingly based on ontologies representing competences within a certain domain. Most of these systems represent a user’s competence profile by means of an ontological structure. Such semantic competence profiles, often structured as a hierarchy of competences, are difficult to navigate for self-assessment purposes. The more competences a user profile holds, the more challenging the comprehensive presentation of profile data is. In this paper, we present an integrated user interface that supports users during competence self-assessment and facilitates a clear presentation of their semantic competence profiles. For evaluation, we conducted a usability study with 19 students at university. The results show that users were mostly satisfied with the usability of the interface that also represents a promising approach for efficient competence self-assessment. Keywords: User Interface, User Profile, Semantic Competence Profile, Profile Editing, Ontology.
1
Introduction
Today, competence management systems (CMSs) play an important role in corporate efforts to ensure the achievement of strategic goals and thus gain sustainable competitive advantage. The major task of a CMS is the provision of information describing an individual’s competences. This information is used to support tasks like expert finding or workforce planning [8]. A user’s competence information is also used for personalization. For instance, in learning management, recommendations for future learning activities are personalized based on a user’s competences. In recent years, CMSs adopted competence ontologies for the representation of competences [4] [18]. Such an ontology consists of competence concepts and the relations between them. Liao et al. [12] use competence ontologies to empower a knowledge-based system to effectively find individuals to accomplish a certain business task. Individuals are represented with competence profiles that contain sets of instances from the underlying competence ontology. Since competences are hierarchically structured, the representation with ontologies seems Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 159–170, 2011. c Springer-Verlag Berlin Heidelberg 2011
160
M. Hochmeister and J. Daxb¨ ock
very suitable. Due to the relations between competences, it is possible to infer additional knowledge about competences. For instance, Sieg et al. [16] use additional knowledge gained from ontological reasoning to improve web search personalization. In the following, we address a competence profile based on a competence ontology as a semantic competence profile. To gain user acceptance for a CMS, it is necessary to leave the ultimate control of profiles to the users [13]. Even though competences may be derived implicitly, the users should be able to maintain them. A review of CMSs [8] reports that employees are increasingly supplied with self-service portals to selfassess competences. Besides competence management, Bull and Kay [3] describe a similar trend in opening profiles to users in the field of intelligent tutoring systems. Giving learners greater control over their learner models may aid learning by supporting the reflection of competences and the planning of future educational activities. There is a need for tools allowing individuals to maintain their competence profiles. Competence ontologies are mostly very large in both breath and depth. The navigation through such ontologies as well as the presentation of semantic competence profiles are major challenges in the design of user interfaces [5] [1]. As for navigation, a conventional tree view of concepts is very cumbersome to handle. A user starts at the top of the tree and navigates to the bottom of it. If this navigation leads to a path the user is not interested in, the user must go back all the way to the starting point. Regarding the presentation of a competence profile, users may quickly lose their sense of the big picture as more concepts are available in their profile. In this paper, we address the following two questions in order to make competence self-assessment easier: 1. How can we support a user in navigating a competence ontology, selecting concepts and assigning values to these concepts? 2. How can we achieve a useful competence profile presentation for the users? In answering these questions, we propose a novel user interface that consists of (1) a navigation and (2) a presentation component. The navigation component helps users to easily select concepts from a competence ontology, to assign them a competence score and finally to store the competence profile to the database. The presentation component aims to provide a comprehensive view of competences as well as several options to adapt this view to personal preferences. Regarding the research method, we adhere to the constructive approach and started out with developing a prototype by means of an iterative design process. We compiled various user interface elements from literature and reviewed their usability. For evaluation, we set up a usability study with 19 master students enrolled in a computer science program at university. Within a tutorial on knowledge management, students were asked to assess their competences in the field of internet technologies by using the proposed interface. We focused mainly on the level of user satisfaction by means of a quantitative feedback. However,
A User Interface for Semantic Competence Profiles
161
there was room for qualitative feedback as well. We logged all user interactions in order to interpret user behavior and analyze problems that might occur during user testing. This research is part of a larger project that visions a system which recommends courses to students based on their competence profile. Students may also benefit from finding other students with the same interests for building learner groups. The integration of this system with the university’s career platform may bring further value for both students and potential employers. The remainder of this paper is organized as follows. The next section discusses related work concerning ontology navigation and approaches how to present profile data. Section 3 describes the components that build up the prototype system and introduce the structure of the competence ontology. In Section 4 and 5 we propose the integrated user interface for competence self-assessment. The setup of the usability study as well as the results are presented in Section 6. We conclude in Section 7 and come up with ideas for future work.
2
Related Work
A survey regarding ontology visualization shows that ontologies are predominantly structured as hierarchies [11]. However, in many domains ontologies tend to be quite large and complex, which makes them difficult to explore and present [17]. The Visual Information Seeking Mantra tackles the problem of representing large data in three steps including overview first, then zoom and filter and show details-on-demand [15]. When dealing with large unknown data, the concept of Information Scents [14] and its application in the form of scented widgets [19] improves traditional user interface elements. Information scents provide users with more context and help them to accomplish tasks more efficent. Crowder et al. [5] make use of content dependent filtering, an autocompletion text box and partial segments using drop-down lists for ontology navigation. With regards to the cognitive support for ontology navigation, d’Entremont and Storey [6] suggest principles to provide overview and context, reduce the complexity, indicate points of interest and support incremental exploration. They further introduce a plugin for the ontology editor Prot´eg´e using these principles in providing Visual Orientation Cues for user relevant content. The user interface Jambalaya [17], also based on Prot´eg´e, employs the concept of nested interchangeable views to allow a user to explore multiple perspectives of information at different levels of abstraction. Based on the reviewed principles, Bakalov et al. [1] present a rich-interaction interface enabling users to inspect and alter their user profiles. The interface provides an overview of terms representing user interests, allows for zooming/filtering and displays additional term information like a term’s relationship with other terms. To the best of our knowledge, none of the reviewed approaches support an ontology navigation that allows users to reflect and compare values between
162
M. Hochmeister and J. Daxb¨ ock
Server
Client Navigation Component Dynamic Drop-Down Lists
Competence Profiles
is_a
OOP User is_a
Autocompletion Text Box
has_competence
Java
Logic
Code Code
Bullet Graphs
has_property
Java Development Kit Profile Presentation
Presentation Component
Expertise Level
is_a
is_a
has_synonym
JDK
is_a
Competence Ontology
Fig. 1. System architecture and competence ontology
several ontology concepts. They also do not facilitate a clear procedure to assign values to ontology concepts.
3
System Architecture
Figure 1 shows the architecture of our prototypical implementation that is based on a three-tier model commonly used for web applications. We iteratively developed the user interface elements into more advanced ones for ontology navigation, competence self-assessment and competence profile presentation. For navigation, competence concepts are retrieved from the ontology on demand. The competence ontology may grow without affecting the interface’s performance. For this retrieval task, AJAX-methods effectively decrease user waiting time and thus increase efficiency. Once users assign competences to their profiles, the whole profile is transferred to the server for data storage. Dorn and Hochmeister [7] introduce a competence ontology as the foundation for their competence mining approach. We use this ontology as a starting point for the ontology design. The ontology represents competence concepts within the domain of internet technologies structured in a hierarchical order. The more general/specific a competence concept is, the higher/lower its place in the hierarchy is. In order to support the assessment of competences, ontology concepts are enhances with a property that holds a value reflecting the expertise level of a competence. The right side of Figure 1 depicts a snippet of the final competence ontology. An ontology instance describes a user who is competent in one or more topics, each with a certain level of expertise. Our ontology design adheres to the overlay model presented by Brusilovsky and Millan [2], where a user’s knowledge is represented as a subset of a domain model. After modification, the competence ontology holds 422 competences and 224 synonyms. The synonyms are used for the autocompletion feature that supports ontology navigation as described in the next section.
A User Interface for Semantic Competence Profiles
4
163
Navigation Component
In this section, we assemble the elements that allow users (1) to navigate through the competence ontology in order to find a desired competence concept and (2) to assign a value to a selected competence representing the user’s expertise level. 4.1
Versatile Ontology Navigation
Crowder et al. [5] present autocompletion text boxes and interconnected dropdown lists as means for ontology navigation. We take these user interface elements as a starting point for our work. Once a user enters a word in the autocompletion text box, the underlying ontology is queried for concepts that match the user’s input at best. The query string will be enhanced with wildcards and the returned result set is further expanded with its concepts children [5]. The resulting list is directly displayed below the text box. We extended the result list with competence values assigned by the user. The left side of Figure 2 shows an example of the autocompletion text box. Eventually, the user selects the desired competence from the list and is able to modify its value. Conventional drop-down lists show all concepts available. This is cumbersome for navigation purpose since they do not maintain the overview of hierarchy. To solve this, interconnected drop-down lists limit the
1a: Ontology navigation using autocompletion
1b: Ontology navigation using drop-down lists 1
Choose concept from top level
2
Breadcrumb
Values for competences already assessed | calculated
3
Choose concept from sub level
2: Value assignment
Fig. 2. Two ways of competence selection leading to value assignment
164
M. Hochmeister and J. Daxb¨ ock
number of elements to the number of hierarchy levels. When a user selects a competence from the list, another drop-down list will pop up including all concepts from the selected competence above as illustrated on the right side of Figure 2. Selecting the option choose... causes the lower drop-down lists to disappear. In order to offer the user a versatile ontology navigation, we combine the autocompletion text box with interconnected drop-down lists. According to Ernst et al. [9], a top-down approach especially helps users unfamiliar with the ontology. On the other hand, advanced users might directly dig into the ontology by selecting a particular concept they assume or even know to exist. Using the combined approach, users can choose their preferred way to explore the ontology. The system starts with a drop-down list of top level ontology concepts shown in Figure 2. The autocompletion text box may be activated by choosing the item autocomplete... in the drop-down list, which is then replaced by an autocompletion text box. The drop-down list can be restored by double-clicking on the autocompletion text box. After selecting a competence in the autocompletion text box, the navigation component switches back to the interconnected dropdown lists providing a competence’s full path to the root of the hierarchy. The user interface presents the current value of the selected competence as shown in the bottom part of Figure 2. For the presentation and modification of competence values, we utilize a graphical element called bullet graph introduced in the next section. 4.2
Assigning Values to Competences
During self-assessment, users assign numeric values between 0 and 100 points to selected competences. To illustrate this task, we introduce an interface element that is based on bullet graphs [10]. Originally, a bullet graph consists of a content box, which represents a qualitative scale, a quantitative scale and a bar illustrating a value. Additionally, a cross bar may indicate a comparative value to qualify the value shown by the bar element. A bullet graph is usually not intended to be interactive. Thus, we build up an interactive bullet graph using widget elements that allows users to drag the value bar to a desired level of expertise. Furthermore, we add labels to describe the fields of the qualitative scale. The comparative value can be used for different purposes, for instance, to represent supervisors’ opinions about the expertise level of their employees. Figure 3 shows the bullet graph including our modifications.
5
Presentation Component
To display a user’s competence profile, we introduce a table including competences, their values and the relation amongst them. Since our ontology represents only hierarchical relations, we make use of an hierarchical approach for profile presentation using a HTML table element as a base. Figure 4 illustrates the presentation of a user’s competence profile. We proceed by incorporating the ideas
A User Interface for Semantic Competence Profiles
165
Qualitative scale with color encoded ranges
Self-assessed value, slider element
Text label Quantitative scale Comparative competence value
Fig. 3. Adapted bullet graph for competence self-assessment Show competence concept path
Color and indentation as visual cues
Sort by column
Filter table
Self-assessed value
Competence's value history
Time since last update
Fig. 4. A user’s competence profile table
of the visual information seeking mantra by Shneiderman [15] as well as information scented widgets [19]. We also consider the principles of cognitive support for ontology navigation by means of visual cues [6]. By using hierarchical visual cues, we adapt the intensity of background color in each table row according to how deep a concept is located in the ontology. A tooltip at the left border of each row shows the path in the ontology leading to the concept in reverse order. For the same purpose, we indent the competence labels. In order to distinguish two succeeding items on the same level but with different top levels, we separate the respective two rows with a thicker grey line. Competence values are represented by circled numbers. When moving the mouse over a competence value, a graphical tooltip visualizes how the value changed over time by means of a filled line chart. The last column of the profile table shows the date of the last modification together with a bar chart representing the time passed since the last update. Users can personalize their competence profile table with filtering and sorting options. A filter text box allows users to filter competences towards a string in a concept’s full path represented by this row. The users can also sort each column to their personal preferences. The components for navigation and presentation are displayed within the same view. This means, the user can search for competences, assign competence values and refer to the competence profile at the same time. The functionalities of both components are linked together as well. A click on the profile table causes the navigation component to refresh and to show the selected competence.
166
6
M. Hochmeister and J. Daxb¨ ock
Evaluation
In order to evaluate usefulness and usability of the proposed user interface, we conducted a usability study with students at university. When referring to usability, we measure user satisfaction and investigate how efficient users may perform competence self-assessment using the interface. The study had a duration of 22 days and took place in a knowledge management course at university. 19 master students of computer science participated in the study. In order to ensure easy access to the user interface, we published the service on the web. The students were free to access the user interface as often and as long as they wanted. We asked the students to self-assess their competences by navigating through the competence ontology, selecting desired competences and assign values to these competences. We provided a short user guide describing the main features of the interface, but did not recommend strategies on how to use it. At the end of the study, the students had to fill out a questionnaire that primarily focused on the measurement of usefulness and user satisfaction. By means of the users’ feedbacks we aimed to interpret the following questions. 1. How satisfied are users with navigating the competence ontology and the selection of competences? 2. How useful is the presentation of self-assessments based on bullet graphs that show values on two different scales? 3. How useful is the presentation of a user’s profile based on the profile table that displays competence values as well as the relations amongst competences? 4. How useful are the sorting and filter functions to adapt a user’s competence profile? Students were also asked to give their opinion about likes and dislikes of the user interface. The interpretation of these feedbacks may reveal further details on how the navigation and presentation of competences can be improved. 6.1
Results and Findings
During the study, users self-assessed 1267 competences. Figure 5 shows the results regarding the quantitative part of the questionnaire. A significant majority was mostly satisfied with the interface for ontology navigation and perceived the bullet graph as useful to specify a competence’s expertise level. As for the competence profile table, the users were predominantly convinced of its usefulness and have used the sorting and filtering functions to adapt the profile table to their preferences. The user opinions mainly confirm the results shown in Figure 5. Some said that the visual navigation cues in the profile table were not clear to them. Others appreciated the extensive use of AJAX for navigation and profile presentation. Figure 6 illustrates the users’ competence self-assessments on a timelime. We aggregated the data in time clusters to better show the total number of assessed
A User Interface for Semantic Competence Profiles Ontology Navigation
167
Competence Self-Assessment: Bullet Graph
18
18
12
12
6
6
0
0
ry Ve
d fie
ss le se U
l fu se U
ss le se U
tly os
tis sa
l fu se U
tly os M
M
ry Ve
is D
d fie
d fie tis sa is D
tly os
tis Sa
d fie tis Sa
tly os
ry Ve
M
M
ry Ve
Competence Profile: Table
Competence Profile: Sort/Filter
18
18
12
12
6
6
0
0
N
s
o
Ye
ry Ve
ss le se U
l fu se U
ss le se U
tly os
l fu se U
tly os M
M
ry Ve
Fig. 5. Results regarding usability and usefulness 100 600
90 Number of competences
Value of self-assessment
80 70 60 50 40 30 20
450
300
150
10 0
0 [1-5]
[6-10]
[11-15]
[16-20]
[21,22]
Period of self-assessment (period of days)
(a) Number of competences per value
[1-5]
[6-10]
[11-15]
[16-20]
[21,22]
Period of self-assessment (period of days)
(b) Total number of competences
Fig. 6. Analysing log data to measure efficiency
competences. The size of the dots in Figure 6a represents the number of competences related to a certain expertise level. Figure 6a shows that users did not use minimum or maximum values for self-assessment. We expected that users would not assign minimum values since they were not asked to declare competences they do not possess. As for the maximum values, Figure 6a confirms the well-known phenomenon that experts seldom assign maximum expertise scores to themselves. It is assumed, experts know better than less competent people that there might be something they do not know. Figure 6a as well as Figure 6b show that the number of assessed competences increases over the course of the study. Is this evidence strong enough to prove the interface to be an efficient support for self-assessment?
168
M. Hochmeister and J. Daxb¨ ock
We could interpret the rise of competence assessments as an indication that the more competences are assessed by users, the faster the self-assessments were performed. This interpretation may be supported by the fact that only one task was given to the users at the beginning of the study. From this point on, the users were free to undertake self-assessment in the given time period and were not asked for further tasks. We can rule out the possible bias that students assessed more competences in favor of getting better marks since students were not required to finish the task with a profile holding a certain number of competences. There certainly is a shortcoming in our study design. Students might have been curious in the first place about how the interface is built up and just started to try it out. While attending other courses students may stick to a plan on when to accomplish tasks for particular courses. This plan could lead to a larger workload in the end of some courses assuming that all courses started at the same time. This might have biased our results. Another limitation of this study is that its participants are to some extent familiar with the domain and the notion of ontologies. We plan to conduct the next user study with students from different study programs to possibly gain more reliable results. Assuming that these biases did not remarkably affect the results, we could interpret that the proposed user interface helps to maintain the overview of competences since this would definitely be a challenge the more competences are assessed. However, for the current stage of our research, it is not totally clear if the interface can be proved as efficient support for self-assessment. We have to consider this issue for a future user study.
7
Conclusion and Future Work
Based on the problem that large competence ontologies are difficult to navigate for self-assessment, we propose an integrated user interface that allows users to easily find competences by various navigation options. In order to assign values to competences, we make use of bullet graphs, which offer a quantitative as well as a qualitative scale to qualify competence scores. We further introduce a competence profile table to display assessed competences and their relations to adjacent competences. The proposed components for navigating and presenting competence profiles are functionally linked together, which allows users to approach competence self-assessment in various ways. We conducted a usability study with 19 master students enrolled in a computer science program. The results show that the users were mostly satisfied with navigating the competence ontology. They perceived the bullet graph for competence assessment as useful and were satisfied with the presentation of competences and its options to adapt it to personal preferences. We could not fully prove if the proposed user interface provides efficient competence self-assessment. Efficient means that the interface speeds up the process of self-assessment. Probable biases affecting the results may have been too
A User Interface for Semantic Competence Profiles
169
dominant for a solid claim. However, the results are promising under certain assumptions and motivate the further investigation of proper evaluation methods to measure efficiency. Regarding future work, we will evaluate the use of a score propagation mechanism to recommend values for competences not scored yet based on competences already assessed by the user. This may increase the efficiency of the self-assessment procedure. Introducing tool tips, little information chunks displayed on mouseover, may provide a more detailed competence description on demand. This was an argued desire from some of the participants. Another idea considers the use of a query language applied in the autocompletion text box. For instance, a user could query ::recommended competences gained from score propagation mentioned before. Acknowledgments. This research is part of the project SeCoMine, which calculates users’ competence scores based on their contributions and social inter¨ actions in online communities. SeCoMine is fully funded by the Osterreichische Forschungsf¨ orderungsgesellschaft mbH (FFG) under the grant number 826459.
References 1. Bakalov, F., K¨ onig-Ries, B., Nauerz, A., Welsch, M.: Introspectiveviews: An interface for scrutinizing semantic user models. User Modeling, Adaptation, and Personalization, 219–230 (2010) 2. Brusilovsky, P., Mill´ an, E.: User models for adaptive hypermedia and adaptive educational systems. The Adaptive Web, 3–53 (2007) 3. Bull, S., Kay, J.: Open learner models. Advances in Intelligent Tutoring Systems, 301–322 (2010) 4. Colucci, S., Di Noia, T., Di Sciascio, E., Donini, F., Ragone, A.: Measuring core competencies in a clustered network of knowledge. In: Knowledge Management: Innovation, Technology and Cultures: Proceedings of the 2007 International Conference on Knowledge Management, Vienna, Austria, August 27-28, p. 279. World Scientific Pub. Co. Inc., Singapore (2007) 5. Crowder, R., Wilson, M.L., Fowler, D., Shadbolt, N., Wills, G., Wong, S.: Navigation over a large ontology for industrial web applications. In: International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. DETC2009-86544 (2009), http://eprints.ecs.soton.ac.uk/17918/ 6. d’Entremont, T., Storey, M.A.: Using a degree of interest model to facilitate ontology navigation. In: IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2009, pp. 127–131 (2009) 7. Dorn, J., Hochmeister, M.: Techscreen: Mining competencies in social software. In: The 3rd International Conference on Knowledge Generation, Communication and Management, Orlando, FLA, pp. 115–126 (2009) 8. Draganidis, F., Mentzas, G.: Competency based management: a review of systems and approaches. Information Management & Computer Security 14(1), 51–64 (2006) 9. Ernst, N., Storey, M., Allen, P.: Cognitive support for ontology modeling. International Journal of Human-Computer Studies 62(5), 553–577 (2005)
170
M. Hochmeister and J. Daxb¨ ock
10. Few, S.: Information dashboard design: the effective visual communication of data. O’Reilly Media, Inc., Sebastopol (2006) 11. Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.: Ontology visualization methods—a survey. ACM Computing Surveys 39(4), 10 (2007) 12. Liao, M., Hinkelmann, K., Abecker, A., Sintek, M.: A competence knowledge base system as part of the organizational memory. In: Puppe, F. (ed.) XPS 1999. LNCS (LNAI), vol. 1570, pp. 125–137. Springer, Heidelberg (1999) 13. Lindgren, R., Henfridsson, O., Schultze, U.: Design principles for competence management systems: a synthesis of an action research study. MIS Quarterly 28(3), 435–472 (2004) 14. Pirolli, P.: Information foraging theory: Adaptive interaction with information. Oxford University Press, USA (2007) 15. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, 1996, pp. 336–343. IEEE, Los Alamitos (2002) 16. Sieg, A., Mobasher, B., Burke, R.: Web search personalization with ontological user profiles. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 525–534. ACM, New York (2007) 17. Storey, M., Musen, M., Silva, J., Best, C., Ernst, N., Fergerson, R., Noy, N.: Jambalaya: Interactive visualization to enhance ontology authoring and knowledge acquisition in prot´eg´e. In: Workshop on Interactive Tools for Knowledge Capture (K-CAP 2001), Citeseer (2001) 18. Tarasov, V., Sandkuhl, K., Henoch, B.: Using ontologies for representation of individual and enterprise competence models. In: 2006 International Conference on Research, Innovation and Vision for the Future, Ho Chi Minh City, Vietnam, February 12-16. IEEE Operations Center, Piscataway (2006) 19. Willett, W., Heer, J., Agrawala, M.: Scented widgets: Improving navigation cues with embedded visualizations. IEEE Transactions on Visualization and Computer Graphics, 1129–1136 (2007)
Open Social Student Modeling: Visualizing Student Models with Parallel IntrospectiveViews I-Han Hsiao1, Fedor Bakalov2, Peter Brusilovsky1, and Birgitta König-Ries2 1
School of Information Sciences, University of Pittsburgh. 135 N. Bellefield Ave., Pittsburgh, PA 15260, USA 2 Institute for Computer Science, University of Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany {ihh4,peterb}@pitt.edu; {fedor.bakalov,birgitta.koenig-ries}@uni-jena.de
Abstract. This paper explores a social extension of open student modeling that we call open social student modeling. We present a specific implementation of this approach that uses parallel IntrospectiveViews to visualize models representing student progress with QuizJET parameterized self-assessment questions for Java programming. The interface allows visualizing not only the student’s own model, but also displaying parallel views on the models of their peers and the cumulative model of the entire class or group. The system was evaluated in a semester-long classroom study. While the use of the system was non-mandatory, the parallel IntrospectiveViews interface caused an increase in all of the usage parameters in comparison to a regular portal-based access, which allowed the student to achieve a higher success rate in answering the questions. The collected data offer some evidence that a combination of traditional personalized guidance with social guidance was more effective than personalized guidance alone. Keywords: Open User Model, Visualization, Parameterized Self-Assessment, Open Student Model.
1 Introduction Engaging students with social learning technologies has become an important trend in modern e-learning. One of the biggest challenges is to provide support in the context of social learning, while at the same time allowing students to feel in control. One popular solution to address the issue of control is the so-called open student modeling, an approach that permits the students to observe and reflect on their progress. In particular, visual approaches for open student modeling were explored to provide students with an easy-to-grasp and holistic view of their progress [1-3]. However, most of the open student modeling research focuses on the representation of an individual student -- ignoring the social aspect of learning. In contrast, several social visualization approaches which were explored in an e-learning context [4] focus mainly on student communication and collaboration rather than on the student’s progress. Our work attempts to explore the potential of open student modeling and Joseph A. Konstan et al. (Eds.): UMAP 2011, LNCS 6787, pp. 171–182, 2011. © Springer-Verlag Berlin Heidelberg 2011
172
I.-H. Hsiao et al.
student progress visualization in the context of modern social e-learning. The goal is to extend the benefits of visualizing the student models from the cognitive aspects to the social aspects of students. We investigate using an open social student modeling approach (which offers parallel views of multiple student models) to guide students to the most appropriate learning content. In this paper, we explore a specific implementation of the open social student modeling approach based on IntrospectiveViews [12] visualization. We do so in the context of a semester-long classroom study. In the next section, we provide a short review of the related work on open user modeling and social learning. The system and study design are presented in Section 3. Then we report the evaluation results. Finally, we summarize this work and discuss the future research plan.
2 Related Work There are two main streams of work on open student models. One stream focuses on visualizing the model to support students’ self-reflection and planning; the other one encourages students to participate in the modeling process, such as engaging students through negotiation or collaboration on the construction of the model [2]. Representations of the student model vary from displaying high-level summaries (such as skill meters) to complex concept maps or Bayesian networks. A range of benefits of opening the student models to the learners have been reported, such as increasing the learner’s awareness of the developing knowledge, difficulties and the learning process, and students’ engagement, motivation, and knowledge reflection [1-3]. Dimitrova et al. [5] explore interactive open learner modeling by engaging learners in negotiating with the system during the modeling process. Chen et al. [6] investigated active open learner models in order to motivate learners to improve their academic performance. Both individual and group open learner models were studied; they both demonstrated an increase in reflection and helpful interactions among teammates. Bull & Kay [7] described a framework to apply open user models in adaptive learning environments and provided many in-depth examples. In our own work on the QuizGuide system [11] we embedded open learning models into the adaptive link annotation and demonstrated that this arrangement can remarkably increase student motivation to work with non-mandatory educational content. To support social learning, it is common to see the use of the average values of a group to represent a particular aspect in the model. Open group modeling enables students to compare and understand their own states of learning. Such group models have been used to support the collaboration between learners among the same group, and to foster competition in a group of learners [8]. Vassileva and Sun [8] investigated the community visualization in online communities. They summarized that social visualization allows peer-recognition and provides students with the opportunity to build trust in others and in the group. Bull & Britland [9] used OLMlets to research the problem of facilitating group collaboration and competition. The results revealed that optionally releasing the models to their peers increased the discussion among students and encourages them to start working sooner. CourseVis [10] is one of the few systems providing graphical visualization for multiple groups of users to teachers and learners. It helps instructors to identify problems early on, and to
Open Social Student Modeling
173
prevent some of the common problems in distance learning. Therefore, it motivates us to further investigate the effectiveness of social visualization techniques in the open student model systems.
3 QuizJET Meets IntrospectiveViews To explore the value of open social student modeling, we extended the educational system QuizJET with an open social student modeling interface based on a modified version of the IntrospectiveViews visualization tools. QuizJET is a system for authoring and delivery of parameterized questions on Java programming language. It generates parameterized questions for assessment and self-assessment of students’ knowledge on a broad range of Java topics. The implementation and functionalities of QuizJET were described in detail in [11]. The IntrospectiveViews visualization approach was first proposed for scrutinizing semantically-enriched user interest models in [12]. The interface visualizes user interests as a set of keywords displayed on a circular surface gradually painted in shades between red and blue, where the gradient colors denote different degrees of interest. It also allows grouping the items into circular sectors by type, i.e., the semantic class they belong to (e.g. person, company, country, etc.).
Fig. 1. Parallel IntrospectiveViews. Left pane – visualization of the student’s own progress; right pane – visualization of a peer’s progress. The circular sectors represent the lectures and the annular sectors represent the topics of individual lectures. The shades of the sectors indicate whether the topic has been covered and for the covered subjects, they denote the progress the student has made. Color screenshots available at: http://www.minerva-portals.de/ research/introspective-views/.
174
I.-H. Hsiao et al.
In [14] we presented an adapted version of IntrospectiveViews, which was modified to fit the context of social learning. This version visualizes learner progress rather than user interests and offers parallel views of two student models so that the user can see not only her own model, but also the models of her peers and the class on average. Below, we briefly describe the application of parallel IntrospectiveViews for visualizing student progress on QuizJET questions. For a more detailed description refer to [14]. Figure 1 shows parallel IntrospectiveViews for a student in a class on ObjectOriented Programming (OOP). The visualization consists of two panes: the left pane displays the student’s own progress and the right one displays the progress of someone else. Each pane visualizes the respective student’s progress as a pie chart. The pie chart representation was chosen because of its capability to visually convey the chronological order of items and their size. The pie chart consists of several circular sectors each representing a class lecture. The lectures are displayed in a clockwise order denoting their pre-requisite sequence, i.e., the order they are taught in class. Lectures may consist of one or several topics, which are represented as annular sectors placed within the circular sector of the corresponding lecture. The radius (width) of annular sectors denotes the amount of readings, quizzes, and exercises assigned to the topic. In a similar way, the span of circular sectors indicates the amount of learning content assigned to the corresponding lecture. Such representation allows the student to easily estimate the amount of work she has to spend on each individual topic or lecture. The shade of each annular sector denotes whether the topic has been covered and, for the covered ones, indicates the progress the student has made with respect to the topic. The sectors painted grey represent the topics that have not been covered yet, whereas the sectors painted a shade from the color range red to green represent the sectors that have been already covered. For the covered topics, the interface displays the student progress. The progress, in the current implementation, is a ratio of successfully completed quizzes to the total quiz count in the topic. If the ratio equals 0, i.e., no quiz has been successfully completed, the sector is painted red. If it equals 1, i.e., all quizzes have been completed, the sector appears green. The shades in the range between red and green denote partial completion of the quizzes.
Fig. 2. Quizzes of the selected topic
Open Social Student Modeling
175
The interface allow the user to see the contents of the corresponding topic by clicking on a particular sector. In the current implementation, the list of questions on the topic is presented when selected (Figure 2). For each question, the interface provides a visual cue indicating the student’s progress and displays the total number of attempts the student has made on the quiz and the number of successful attempts. By clicking on a quiz label, the interface will display the quiz in a new window. Our hypothesis is that such visualization can help the student to plan her class work by providing an overview of her progress in the class and showing the topics that she has already completed as well as those yet to be worked on. In addition to that, we believe that the ability to view someone else’s progress can help the student to quickly find the peers that can help with a difficult topic or quiz. The class study described in the next section reveals whether or not this hypothesis is true.
4 The Classroom Study and the Results To assess the impact of our technology, we have conducted a thorough evaluation in a semester-long classroom study. The study was performed in an undergraduate ObjectOriented Programming course offered by the School of Information Sciences, University of Pittsburgh in the Fall semester of 2010. All students received access to self-assessment quizzes through the IntrospectiveViews (IV) interface. The system was introduced to the class at the beginning of the course and served as a nonmandatory course tool over the entire semester. Of the 32 students enrolled in the course, 18 actively used the system. All student activity with the system was recorded. For every student attempt to answer a question, the system stored a timestamp, the user’s name, the question, quiz, and session ids, and the results (right or wrong). We also recorded the frequency and timing of student model access and comparisons. Pre- and post- tests were administered at the beginning and the end of the semester in order to measure the gain in students’ learning. At the end of the semester, the students were asked to provide their subjective feedback about the system and its features by completing the evaluation questionnaire. 4.1 Effects on System Usage On average, each student attempted 113 different questions and achieved a success rate of 71.35% on answering the questions. On average, students tried 9 out of 17 distinct topics and 36.5 out of 103 distinct question. The data is summarized in Table 1. Following our prior experience with open student modeling in JavaGuide [11], we expected that the ability to view student knowledge progress would encourage the students to work more with the system. To assess it, we compared the student usage of self-assessment quizzes through IV (Column 1 in Table 1) with the data from a comparable class that accessed quizzes using a traditional course portal with no progress visualization (Column 2 in Table 1) and another class accessing quizzes through an adaptive hypermedia system JavaGuide (Column 3 in Table 1). We found that the social visualization of student models with IntrospectiveViews resulted in a 39% increase in the average attempts compared to the traditional course portal. The students also explored more topics, tried more distinct questions, and accessed the
176
I.-H. Hsiao et al.
system more frequently. In brief, we observed an increase in all usage parameters similar to that it was observed in a very different JavaGuide interface. At the same time, the increase in usage was not as high as in the case of JavaGuide. As a result, no significant difference on the usage level was found between IV and the portal as well as between IV and JavaGuide. Table 1. Summary of Basic Statistics of System Usage 1
Parameters
n=18
2 QuizJET w/ Portal n=16
Attempts
113.05 ± 15.17
80.81 ± 22.06
125.50 ± 20.04
QuizJET w/ IV
Average User Statistics
Average User Session Statistics
3 JavGuide n=22
Success Rate
71.35% ± 3.39%
42.63% ± 1.99%
58.31% ± 7.92%
Distinct Topics Distinct Questions Attempts
9.06 ± 1.39
7.81 ± 1.64
11.77 ± 1.19
36.5 ± 5.69
33.37 ± 6.50
46.18 ± 5.15
27.51
21.55
30.34
Distinct Topics
2.20
2.31
2.85
Distinct Questions
8.88
8.9
Average Sessions Pre-test score (M ±SE) Post-test score (M ±SE) Normalized Knowledge Gain
11.16
4.11 ± 0.70
3.75 ± 0.53
4.14 ± 0.75
6.38 ± 1.12 13.71 ± 1.00 0.43 ± 0.07
9.56 ± 1.29 17.12 ± 0.86 0.36 ± 0.05
4.97 ± 0.85
IntrospectiveViews Average Comparison mode
Class on Average Peers Topics Questions
3.33 ± 0.71 6.83 ± 2.25 4.00 ± 0.79 4.67 ± 1.36
Since the student own knowledge visualization was relatively similar in IV and JavaGuide, a slighter increase in student activity in IV could be attributed to the social side of open social student modeling. While the access to social data could encourage less active users to do more work, it can also discourage very active users from jumping too much ahead of the class. As a result, the difference between the most active and least active users is getting smaller. Evidence that this is really happening is the observed 25% decrease in standard deviations for the number of attempts. In turn, the class as a whole became a bit less adventurous than in non-social JavaGuide, exploring fewer questions and topics (this is because the variety of topics come to some extent from more active users who run ahead of the class). This effect can be also observed in IV, especially the session level. While the amount of work per session increases for IV, question and topic coverage stays the same. In sum, as a whole, social guidance provided by the access to class progress mediates the motivating effect of progress visualization by making the whole class a bit less adventurous and more conservative than without social guidance tools. An interesting question is whether a more conservative increase in the amount of work
Open Social Student Modeling
177
and variety of explored context is a good or a bad thing. Our evidence shows that it might actually be a good thing. As Table 1 shows, students using social visualization in IV achieved the highest success rate (a ratio of correct solutions to total attempts) among all conditions. This is significantly higher than for the portal case, F(1,32)= 11.303, p