Key Topics in Surgical Research and Methodology
Thanos Athanasiou Haile Debas Ara Darzi (Eds.)
Key Topics in Surgical Research and Methodology
Thanos Athanasiou MD, PhD, FETCS Imperial College London St. Mary’s Hospital London Dept. Biosurgery & Surgical Technology 10th floor QEQM Bldg. Praed Street W2 1NY, London United Kingdom
Ara Darzi KBE, PC, FMedSci, HonFREng Imperial College London St. Mary’s Hospital London Dept. Biosurgery & Surgical Technology 10th floor QEQM Bldg. Praed Street W2 1NY, London United Kingdom
Dr. Haile T. Debas MD UCSF Global Health Sciences 3333 California Street, Suite 285 San Francisco L, CA 94143-0443 USA
ISBN: 978-3-540-71914-4
e-ISBN: 978-3-540-71915-1
DOI: 10.1007/978-3-540-71915-1 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2009933270 © Springer-Verlag Berlin Heidelberg 2010 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Product liability: The publishers cannot guarantee the accuracy of any information about dosage and application contained in this book. In every individual case the user must check such information by consulting the relevant literature. Cover design: eStudio Calamar, Figueres/Berlin Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Foreword
This is the first book to distil the tried experiences and reflective thoughts of three world leaders in academic surgery to comprehensively present the essence of building a department of surgery of stature. It is an ambitious undertaking. The contents cover all areas of activity that a modern and successful Professor of Surgery and Chief of Surgery of an academic department must be fully engaged in. These include traditional areas such as teaching, research, clinical service and administration. The demands for each of these areas have increased enormously in recent years. In addition however there are other important aspects particularly relevant to today’s surgical practice such as healthcare delivery and leadership. The vast amount of new knowledge currently available in these areas of responsibility, even for the most conscientious, is impossible to assimilate in a timely manner. This book covers all these and other subjects with sufficient information to very adequately arm the enquirer. Bringing the many divergent threads that represent the required core skills and weaving them into a complex interlocking fabric has been excellently achieved. The chapters are contributed by world leaders, and embody the definitive current record. It is a text for anyone who aspires to pursue a career in academic surgery; in addition, it is also essential reading for those who wish to engage in the critical and rigorous intellectual exercises of the thoughtful surgeon. In the Title and Preface the authors have placed emphasis on surgical research and its methods. This is to be interpreted in the broadest sense, as in many chapters of this book, there is a focus on the clinical care of patients. As such, it is also for those who wish to provide top quality service to their patients, whether in the university, teaching hospital environment or in the rural setting. The authors combine the best of cross-Atlantic thoughts on developing a surgical department of excellence. It is not a cook book that will ensure success, but it will help to avoid much of the learning curve and help to minimise mistakes on the way. Consulting this work gives contextual background to aid in learning on the job and contributing to world-class research. It is an academic manual of the highest quality that communicates the most refined skills of leading academic units and surgeons.
v
vi
Foreword
The concept of this book is to go beyond the restrictive nature of traditional surgical texts, and prepares the future academic leaders in surgery. With the changing scenario in so many fields, it is predictable that a book such as this will need to be updated regularly. Professor John Wong Chair of Surgery & Head Department of Surgery University of Hong Kong Medical Centre Queen Mary Hospital Hong Kong
Preface
Academic surgery has gained considerable importance over the last century, and it continues to benefit from the significant advances in science and technology. Its role in the continually evolving world of modern healthcare is becoming increasingly influential. Many of the recent innovations in our surgical practice such as minimally invasive surgery, telerobotic surgery and metabolic surgery have been spearheaded by academic surgeons. This has only been possible through significant efforts in the implementation of cutting-edge research and the adoption of evidence-based practice. Much of this has been realised through judicious surgical leadership and academic departmental organisations who foster an environment where the best candidates can be selected. Individuals central to this approach are surgeons who are not only technically proficient but also academically productive. Creating a dynamic exchange between research and clinical expertise has not always existed in the surgical profession. There are numerous operative practices that have few standards or have paradigms that are not wholly based on the best available science or evidence. The solution is self-explanatory and requires the adoption of educational excellence, technical proficiency and continual innovative research. Academic surgeons are key to implementing many of these strategic goals and will require an understating of many disciplines that range from basic laboratory research to statistical awareness of complex analytical methods. These proficiencies need to be accompanied by academic leadership, expertise in communication and non-technical skills. The aim of this book is to equip surgeons across all disciplines and specialities to enhance their academic know-how in order to successfully work within a surgical academic unit and to maximise their academic potential. The goals are to endow the fundamental scientific tents of surgical science, and also to increase the awareness of the equally important areas of departmental collaboration, the adoption of business acumen, engineering knowledge and industrial know-how. It addresses a whole range of topics ranging from how to incorporate best surgical evidence, applying for grants, performing a research study, applying ethics to research, setting-up a surgical education programme and running an academic department. It also communicates many of the surgical technological highlights that are considered important in modern surgical practice and presents some of the most significant bimolecular concepts of the present and future. Surgical research has improved in quality over the past few decades, and we present this book to advocate further the use of high-quality research in the form of clinical research trials. We strongly emphasise the importance of randomised studies with clearly defined, clinically relevant endpoints. Many of the chapters also focus on the increasing developments of biomedical technology in modern surgical practice. vii
viii
Preface
They clarify the increasing need to understand and adopt these developments to augment surgical practice and patient outcomes. The role of evidence-based surgery is also given particular focus. Although reading, interpreting and applying the best knowledge from the literature is one aspect of this field, it does not represent “all the evidence” available. This book considers a broader concept of evidence, and by doing so, specifies the central role of patients themselves within evidence-based practice. This is best understood through an equilibrium between the surgeon, the patient and the healthcare institution. The concepts presented will require application within the context of healthcare organisations and institutions worldwide. Many of these are already large or are in the process of significant growth, requiring visionary leadership strategies. An example includes the Academic Health Science Centre model, where collaboration, research networking and global cooperation are imperative. The scope of this book has been targeted to allow academic surgeons to exploit their local advantages whilst bridging the gap between surgical practice, patient safety and laboratory research. It will give an oversight of the importance of surgical research both locally and internationally. Many of the topics covered also highlight the importance of surgical research to governmental departments and policy makers. It will enable surgeons to clarify and prioritise the continuous influx of knowledge within the international literature. It strives to define the characteristics of talented individuals whilst also specifying the importance of market forces and administrative management. As such, we present it as a dedicated guide of modern academic surgery. The future of the surgical profession lies in the development of our knowledge, treatment resources and our most prized asset, surgeons themselves. We cannot only enhance our current strengths, but we require the continual advancement of the next generation of our trainees. A roadmap for the development of our future surgeons can be achieved through academic curricula. We therefore envisage this book as a foundation guide for the training of academic surgeons. This project would not have been possible without the significant knowledge forwarded by the chapter authors, many of whom are world leaders in their field. We thank our many colleagues and friends who helped us in this endeavour. The units where we work, namely the Department of Biosurgery and Surgical Technology at Imperial College London and the School of Medicine at the University of California, San Francisco, are sites of great inspiration and rewarding academic crosstalk that motivated us to write and prepare this book. London, UK San Francisco, USA London, UK
Thanos Athanasiou Haile T. Debas Ara Darzi
Acknowledgements
The editors want to specifically forward their appreciation to a number of individuals without whom this book would not have been possible. Beth Janz tirelessly managed the book from its inception, devoting long hours in communication with contributors and editors to bring this book to completion. Specific thanks also go to Hutan Ashrafian, who worked with energy and skill to co-ordinate many of the authors in keeping this project on track. We also recognise Christopher Rao for his dedicated graphical support on many of the chapter figures and also Erik Mayer for his continued assistance with this endeavour.
ix
About the Editors
Mr. Thanos Athanasiou, MD, PhD, FETCS Reader in Cardiac Surgery and Consultant Cardiac Surgeon
Mr. Thanos Athanasiou is a consultant cardiothoracic surgeon at St. Mary’s Hospital, Imperial College Healthcare NHS Trust and a reader of cardiac surgery in the Department of Biosurgery and Surgical Technology at Imperial College London. He specialises in complex aortic surgery, coronary artery bypass grafting (CABG), minimally invasive cardiac surgery and robotic assisted cardiothoracic surgery. His institutional responsibility is to lead academic cardiac surgery and complex aorta surgery. He is currently supervising eight MD/PhD students and has published more than 200 peer-reviewed journal papers. He has given several invited lectures in national and international forums in the field of technology in cardiac surgery, healthcare delivery and quality in surgery. His specialty research interest includes bio-inspired robotic systems and their application in cardiothoracic surgery, outcomes research in cardiac surgery, metabolic surgery and regenerative cardiovascular strategies.
xi
xii
His general research interests include quality metrics in healthcare and evidence synthesis including meta-analysis, decision and economic analysis. His statistical interests include longitudinal outcomes from cardiac surgical interventions. He has recently developed and published a novel methodology for analysing longitudinal and psychometric data. Link to the personal web page: http://www.thanosathanasiou.co.uk
Professor Haile T. Debas, MD Executive Director, UCSF Global Health Sciences, Maurice Galante Distinguished Professor of Surgery & Dean Emeritus, School of Medicine
Haile T. Debas, MD, the Executive Director of UCSF Global Health Sciences, is recognized internationally for his contributions to academic medicine and is currently widely consulted on issues associated with global health. At UCSF, he served as Dean (Medicine), Vice Chancellor (Medical Affairs), and Chancellor. Dr. Debas is also the Maurice Galante Distinguished Professor of Surgery and chaired the UCSF Department of Surgery. A native of Eritrea, he received his MD from McGill University and completed his surgical training at the University of British Columbia. Under Dr. Debas’s stewardship, the UCSF School of Medicine became a national model for medical education, an achievement for which he was recognized with the 2004 Abraham Flexner Award of the AAMC. His prescient grasp of the implications of fundamental changes in science led him to create several interdisciplinary research centres that have been instrumental in reorganising the scientific community at UCSF. He played a key role in developing UCSF’s new campus at Mission Bay. He has held
About the Editors
About the Editors
xiii
leadership positions with numerous membership organisations and professional associations, including serving as President of the American Surgical Association and Chair of the Council of Deans of the AAMC. He served for two terms as a member of the Committee on Science, Engineering, and Public Policy of the National Academy of Sciences. He is a member of the Institute of Medicine and has served as Chair of the Membership Committee. He is a fellow of the American Academy of Arts and Sciences. He currently serves on the United Nations’ Commission on HIV/ AIDS and Governance in Africa, and is a member of the Board of Regents of the Uniformed Services University of the Health Sciences.
Professor the Lord Darzi of Denham, KBE, PC, FMedSci, HonFREng Paul Hamlyn Chair of Surgery at Imperial College London. Honorary Consultant Surgeon at The Royal Marsden NHS Foundation Trust and Chairman of the Section of Surgery at The Institute of Cancer Research
Professor Lord Darzi holds the Paul Hamlyn Chair of Surgery at Imperial College London where he is Head of the Department of Biosurgery and Surgical Technology. He is an honorary consultant surgeon at Imperial College Hospital NHS Trust and the Royal Marsden Hospital. He also holds the Chair of Surgery at the Institute of Cancer Research. Professor Lord Darzi and his team are internationally respected for their innovative work in the advancement of minimal invasive surgery, robotics and allied technologies. His research is directed towards achieving best surgical practice through both innovation in surgery and enhancing the safety and quality of healthcare. This includes the evaluation of new technologies, studies of the safety and quality of care, the development of methods for enhancing healthcare delivery and new approaches for education and training. His contribution within these research fields has been outstanding, publishing over 500 peer-reviewed research papers to date. In recognition of his outstanding achievements in research and development of surgical technologies,
xiv
Professor Lord Darzi was elected as an Honorary Fellow of the Royal Academy of Engineering, and a Fellow of the Academy of Medical Sciences. Following a Knighthood in 2002 for his service to medicine and surgery, Professor Lord Darzi was introduced to the House of Lords in 2007 and appointed as Parliamentary Under Secretary of State at the Department of Health (2007–2009). At the Prime Minister’s request, Professor Lord Darzi led a review of the United Kingdom’s National Health Service, with the aim of achieving high quality care for all national healthcare patients. He was awarded the Queen’s approval of membership in Her Majesty’s most honourable Privy Council in 2009. Professor Lord Darzi is currently appointed as the Global Ambassador for Health and Life Sciences, and Chair of NHS Global for the Cabinet office.
About the Editors
Contents
1
The Role of Surgical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Omer Aziz and John G. Hunter
1
2
Evidence-Based Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hutan Ashrafian, Nick Sevdalis, and Thanos Athanasiou
9
3
The Role of the Academic Surgeon in the Evaluation of Healthcare Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roger M. Greenhalgh
27
Study Design, Statistical Inference and Literature Search in Surgical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Petros Skapinakis and Thanos Athanasiou
33
4
5
Randomised Controlled Trials: What the Surgeon Needs to Know . . Marcus Flather, Belinda Lees, and John Pepper
55
6
Monitoring Trial Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hutan Ashrafian, Erik Mayer, and Thanos Athanasiou
67
7
How to Recruit Patients in Surgical Studies . . . . . . . . . . . . . . . . . . . . . Hutan Ashrafian, Simon Rowland, and Thanos Athanasiou
75
8
Diagnostic Tests and Diagnostic Accuracy in Surgery . . . . . . . . . . . . . Catherine M. Jones, Lord Ara Darzi, and Thanos Athanasiou
83
9
Research in Surgical Education: A Primer . . . . . . . . . . . . . . . . . . . . . . Adam Dubrowski, Heather Carnahan, and Richard Reznick
99
10
Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum . . . . . . . . . . . . . . . . . . . Raj Aggarwal and Lord Ara Darzi
115
Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . Jane M. Blazeby
129
11
xv
xvi
12
Contents
Surgical Performance Under Stress: Conceptual and Methodological Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonal Arora and Nick Sevdalis
141
13
How can we Assess Quality of Care in Surgery? . . . . . . . . . . . . . . . . . Erik Mayer, Andre Chow, Lord Ara Darzi, and Thanos Athanasiou
151
14
Patient Satisfaction in Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andre Chow, Erik Mayer, Lord Ara Darzi, and Thanos Athanasiou
165
15
How to Measure Inequality in Health Care Delivery . . . . . . . . . . . . . . Erik Mayer and Julian Flowers
175
16
The Role of Volume–Outcome Relationship in Surgery. . . . . . . . . . . . Erik Mayer, Lord Ara Darzi, and Thanos Athanasiou
195
17
An Introduction to Animal Research . . . . . . . . . . . . . . . . . . . . . . . . . . . James Kinross and Lord Ara Darzi
207
18
The Ethics of Animal Research. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hutan Ashrafian, Kamran Ahmed, and Thanos Athanasiou
229
19
Ethical Issues in Surgical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amy G. Lehman and Peter Angelos
237
20
Principles and Methods in Qualitative Research . . . . . . . . . . . . . . . . . Roger Kneebone and Heather Fry
243
21
Safety in Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles Vincent and Krishna Moorthy
255
22
Safety and Hazards in Surgical Research . . . . . . . . . . . . . . . . . . . . . . . Shirish Prabhudesai and Gretta Roberts
271
23
Fraud in Surgical Research – A Framework of Action Is Required. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conor J. Shields, Desmond C. Winter, and Patrick Broe
283
A Framework Is Required to Reduce Publication Bias The Academic Surgeon’s View . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ronnie Tung-Ping Poon and John Wong
293
24
25
26
Data Collection, Database Development and Quality Control: Guidance for Clinical Research Studies . . . . . . . . . . . . . . . . . Daniel R. Leff, Richard E. Lovegrove, Lord Ara Darzi, and Thanos Athanasiou The Role of Computers and the Type of Computing Skills Required in Surgery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julian J. H. Leong
305
321
Contents
xvii
27
Computational and Statistical Methodologies for Data Mining in Bioinformatics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lee Lancashire and Graham Ball
28
The Use of Bayesian Networks in Decision-Making . . . . . . . . . . . . . . . Zhifang Ni, Lawrence D. Phillips, and George B. Hanna
29
A Bayesian Framework for Assessing New Surgical Health Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elisabeth Fenwick
337
351
361
30
Systematic Reviews and Meta-Analyses in Surgery . . . . . . . . . . . . . . . Sukhmeet S. Panesar, Weiming Siow, and Thanos Athanasiou
375
31
Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Rao and Thanos Athanasiou
399
32
Cost-Effectiveness Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Rao and Thanos Athanasiou
411
33
Value of Information Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Rao and Thanos Athanasiou
421
34
Methodological Framework for Evaluation and Prevention of Publication Bias in Surgical Studies . . . . . . . . . . . . . . . . . . . . . . . . . Danny Yakoub, Sukhmeet S. Panesar, and Thanos Athanasiou
35
Graphs in Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akram R.G. Hanna, Christopher Rao, and Thanos Athanasiou
36
Questionnaires, Surveys, Scales in Surgical Research: Concepts and Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Shamim Rahman, Sana Usman, Oliver Warren, and Thanos Athanasiou
429
441
477
37
How to Perform Analysis of Survival Data in Surgery. . . . . . . . . . . . . Fotios Sianis
495
38
Risk Stratification and Prediction Modelling in Surgery. . . . . . . . . . . Vassilis G. Hadjianastassiou, Thanos Athanasiou, and Linda J. Hands
507
39
The Principles and Role of Medical Imaging in Surgery . . . . . . . . . . . Daniel Elson and Guang-Zhong Yang
529
40
How to Read a Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hutan Ashrafian and Thanos Athanasiou
545
41
How to Evaluate the Quality of the Published Literature . . . . . . . . . . Andre Chow, Sanjay Purkayastha, and Thanos Athanasiou
557
xviii
Contents
42
How to Write a Surgical Paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Purkayastha
569
43
A Primer for Grant Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hutan Ashrafian, Alison Mortlock, and Thanos Athanasiou
579
44
Key Aspects of Grant Applications: The Surgical Viewpoint . . . . . . . Bari Murtuza and Thanos Athanasiou
587
45
How to Organise an Educational Research Programme Within an Academic Surgical Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamran Ahmed, Hutan Ashrafian, and Paraskevas Paraskeva
597
46
How to Structure an Academic Lecture. . . . . . . . . . . . . . . . . . . . . . . . . Bari Murtuza and Thanos Athanasiou
605
47
How to Write a Book Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Rao and Thanos Athanasiou
611
48
How to Organise a Surgical Meeting: National and International . . . Bari Murtuza and Thanos Athanasiou
615
49
Presentation Skills in Surgery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Purkayastha
625
50
Internet Research Resources for Surgeons . . . . . . . . . . . . . . . . . . . . . . Santhini Jeyarajah and Sanjay Purkayastha
629
51
Clinical Practice Guidelines in Surgery. . . . . . . . . . . . . . . . . . . . . . . . . Shawn Forbes, Cagla Eskicioglu, and Robin McLeod
637
52
From Idea to Bedside: The Process of Surgical Invention and Innovation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Wall, Geoffrey C. Gurtner, and Michael T. Longaker
647
Research Governance and Research Funding in the USA: What the Academic Surgeon Needs to Know . . . . . . . . . . . . . . . . . . . . Michael W. Mulholland and James A. Bell
657
Research Governance in the UK: What the Academic Surgeon Needs to Know. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gary C. Roper
669
Research Funding, Applying for Grants and Research Budgeting in the UK: What the Academic Surgeon Needs to Know. . . . . . . . . . . Karen M Sergiou
677
How to Enhance Development and Collaboration in Surgical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Ellis
695
53
54
55
56
Contents
xix
57
Mentoring in Academic Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Warren and Penny Humphris
715
58
Leadership in Academic Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Warren and Penny Humphris
727
59
Using Skills from Art in Surgical Practice and Research-Surgery and Art. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Donna Winderbank-Scott
60
Administration of the Academic Department of Surgery . . . . . . . . . . Carlos A. Pellegrini, Avalon R. Lance, and Haile T. Debas
61
Information Transfer and Communication in Surgery: A Need for Improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamal Nagpal and Krishna Moorthy
62
General Surgery: Current Trends and Recent Innovations. . . . . . . . . John P. Cullen and Mark A. Talamini
63
Upper Gastrointestinal Surgery: Current Trends and Recent Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danny Yakoub, Oliver Priest, Akram R. George, and George B. Hanna
64
Colorectal Cancer Surgery: Current Trends and Recent Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Priest, Paul Ziprin, and Peter W. Marcello
65
Urology: Current Trends and Recent Innovations . . . . . . . . . . . . . . . . Erik Mayer and Justin Vale
66
Cardiothoracic Surgery: Current Trends and Recent Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joanna Chikwe, Thanos Athanasiou, and Adanna Akujuo
741
753
771
781
793
815
833
849
67
Vascular Surgery: Current Trends and Recent Innovations . . . . . . . . Mark A. Farber, William A. Marston, and Nicholas Cheshire
875
68
Breast Surgery: Current Trends and Recent Innovations . . . . . . . . . . Dimitri J. Hadjiminas
895
69
Thyroid Surgery: Current Trends and Recent Innovations . . . . . . . . Charlie Huins and Neil Samuel Tolley
905
70
Orthopaedic Surgery: Current Trends and Recent Innovations. . . . . Andrew Carr and Stephen Gwilym
913
71
Plastic, Reconstructive and Aesthetic Surgery: Current Trends and Recent Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marios Nicolaou, Matthew D. Gardiner, and Jagdeep Nanchahal
923
xx
Contents
72
Neurosurgery: Current Trends and Recent Innovations . . . . . . . . . . . David G.T. Thomas and Laurence Watkins
941
73
Molecular Techniques in Surgical Research . . . . . . . . . . . . . . . . . . . . . Athanassios Kotsinas, Michalis Liontos, Ioannis S. Pateras, and Vassilis G. Gorgoulis
951
74
Molecular Carcinogenesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Zachariadis, Konstantinos Evangelou, Nikolaos G. Kastrinakis, Panagiota Papanagnou, and Vassilis G. Gorgoulis
975
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005
Contributors
Raj Aggarwal Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St. Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] Kamran Ahmed, MBBS, MRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, at St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Adanna Akujuo, MD Department of Cardiothoracic Surgery, Mount Sinai Medical Centre, 1190 Fifth Avenue, New York, NY 10029, USA
[email protected] Peter Angelos, MD, PhD, FACS Department of Surgery,The University of Chicago, University of Chicago Medical Center, 5841 South Maryland Avenue, MC 5031, Chicago, IL 60637, USA
[email protected] Sonal Arora, BSc, MBBS, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th floor, QEQM, St. Mary’s Hospital, South Wharf Road, London W2 1NY, UK
[email protected] Hutan Ashrafian, MBBS, BSc(Hons), MRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Thanos Athanasiou, MD, PhD, FETCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] xxi
xxii
Omer Aziz, MBBS, BSc, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor QEQM Building, St. Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] Graham Ball, BSc, PhD The John Van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Clifton Lane, Nottingham NG11 8NS, UK
[email protected] James A. Bell, CPA, JD University of Michigan Health Systems, 2101 Taubman Center/SPC 5346, 1500 East Medical Center Drive, Ann Arbor, MI 48109, USA
[email protected] Jane M. Blazeby, MSc, MD, FRCS University Hospitals Bristol, NHS Foundation Trust, Level 7, Bristol Royal Infirmary, Marlborough Street, Bristol BS2 8HW, UK
[email protected] Patrick Broe, MCh, FRCSI Royal College of Surgeons in Ireland, Beaumont Hospital, Dublin, Ireland
[email protected] Heather Carnahan, PhD Department of Occupational Science and Occupational Therapy, University of Toronto, The Wilson Centre, 200 Elizabeth Street, Toronto, ON, M5G 2C4, Canada
[email protected] Andrew Carr, FRCS Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Nuffield Orthopaedic Centre, Windmill Road, Oxford OX3 7LD, UK
[email protected] Nicholas Cheshire, MD, FRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Wing, St Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] Joanna Chikwe, MD, FRCS Department of Cardiothoracic Surgery, Mount Sinai Medical Centre, 1190 Fifth Avenue, New York, NY 10029, USA
[email protected] Andre Chow, BSc (Hons), MBBS, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, QEQM Building, St Mary’s Hospital Campus, 10th Floor, Praed Street, London, W2 1NY, UK
[email protected] John P. Cullen, MD Department of Surgery, University of California at San Diego, 200 West Arbor Drive, 8400, San Diego, CA, USA
Contributors
Contributors
xxiii
Lord Ara Darzi, MD, FRCS, KBE The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Haile T. Debas University of California, 3333 California Street, Suite 285, San Francisco, CA 94143-0443, USA
[email protected] Adam Dubrowski, PhD Centre for Nursing Education Research, University of Toronto, 155 College Street, Toronto, ON, M5T 1P8, Canada
[email protected] Peter Ellis People in Health, Ability House, 7 Portland Place, London W1B 1PP, UK
[email protected] Daniel Elson, PhD Department of Biosurgery and Surgical Technology, Institute of Biomedical Engineering, Imperial College London, London SW7 2AZ, UK
[email protected] Cagla Eskicioglu, MD Department of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, and Zane Cohen Digestive Diseases Clinical Research Centre, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, ON, Canada Konstantinos Evangelou, BSc, MD, PhD Department of Histology & Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Mark A. Farber, MD, FACS University of North Carolina, 3025 Burnett Womack, Chapel Hill, NC 27599, USA
[email protected] Elisabeth Fenwick Community Based Sciences, University of Glasgow, 1 Lilybank Gardens, Glasgow G12 8RZ, UK
[email protected] Marcus Flather, FRCP Clinical Trials and Evaluation Unit, Royal Brompton and Hospital and Imperial College, London SW3 6NP, UK
[email protected] Julian Flowers Eastern Region Public Health Observatory, Institute of Public Health, Robinson Way, Cambridge CB2 0SR, UK
[email protected] xxiv
Shawn Forbes, BSc, MD, FRCSC Department of Surgery, University of Toronto, Toronto, ON, and Zane Cohen Digestive Diseases Clinical Research Centre, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, ON, Canada Heather Fry, BA, Dip Ed, Dip ARM, MPhil Higher Education Funding Council for England, Northavon House, Coldharbour Lane, Bristol BS16 1QD, UK
[email protected] Matthew D. Gardiner, MA, BM, BCh, MRCS Kennedy Institute of Rheumatology Division, Imperial College London, 65 Aspenlea Road, London W6 8LH, UK
[email protected] Akram R.H. George, MBBS, MRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Roger M. Greenhalgh, MA, M.D, MChir, FRCS Division of Surgery, Oncology, Reproductive Biology & Anaesthetics, Imperial College, Charing Cross Hospital, Fulham Palace Road, London W6 8RF, UK
[email protected] Vassilis G. Gorgoulis, MD, PhD Department of Histology & Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Geoffrey C. Gurtner, MD The Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford University, 257 Campus Drive, Stanford, CA 94305-5148, USA
[email protected] Stephen Gwilym, MRCS Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Nuffield Orthopaedic Centre, Windmill Road, Oxford OX3 7LD, UK
[email protected] Vassilis G. Hadjianastassiou, DM (Oxon), FEBVS, FRCS (Gen), BMBCh (Oxon), BSc Department of Transplantation, Directorate of Nephrology Transplantation and Urology, Guy’s & St. Thomas’ NHS Foundation Trust, Guy’s Hospital, St. Thomas’ Street, London SE1 9RT, UK
[email protected] Dimitri J. Hadjiminas, MD, FRCS Department of Breast and Endocrine Surgery, St Mary’s Hospital NHS Trust, Praed Street, London W2 1NY, UK
[email protected] Contributors
Contributors
xxv
George B. Hanna, PhD, FRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Linda J. Hands, MA, BSc, MBBS, FRCS, MS Nuffield Department of Surgery, University of Oxford, 6th Floor, John Radcliffe Hospital, Headley Way, Headington, Oxford OX3 9DU, UK
[email protected] Charlie Huins, BSc (Hons), MRCS, DOHNS, MSc Department of Ear, Nose and Throat Surgery, St Mary’s Hospital Trust, Praed Street, London W2 1NY, UK
[email protected] Penny Humphris, MSc (Econ), CBE The Department of BioSurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK John G. Hunter, MD, FACS Department of Surgery, Oregon Health & Science University, 3181 SW Sam Jackson Park Road – L223, Portland, OR 97239-3098, USA
[email protected] Santhini Jeyarajah Department of General Surgery, Royal London Hospital, Whitechapel, London E1 1BB, UK
[email protected] Catherine M. Jones, BSc, FRCR The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Nikolaos G. Kastrinakis, BSc, MSc, PhD Department of Histology & Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Roger Kneebone, PhD, FRCS, FRCGP Department of Biosurgery and Surgical Technology, Chancellor’s Teaching Centre, 2nd Floor QEQM Wing, Imperial College London, St Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] James Kinross, MBBS, BSc, MRCS Department of Biosurgery and Surgical Technology, Imperial College, 10th floor, QEQM, St. Mary’s Hospital, Praed Street, London, W2 1NY, UK
[email protected] xxvi
Athanassios Kotsinas, BSc, PhD Department of Histology–Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Lee Lancashire, BSc, MSc, PhD Paterson Institute for Cancer Research, University of Manchester, Manchester M20 4BX, UK
[email protected] Avalon R. Lance, BSN, MHA Department of Surgery, University of Washington, Box 356410, Seattle, WA 98195-6410, USA
[email protected] Belinda Lees, PhD Clinical Trials and Evaluation Unit, Royal Brompton and Harefield NHS Trust, Sydney Street, London, and National Heart and Lung Institute, Imperial College London, London, UK Daniel R. Leff, MBBS, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St Mary’s Hospital Campus, Praed Street, London, W2 1NY, UK
[email protected] Amy G. Lehman, MD, MBA Department of Surgery, The University of Chicago, University of Chicago Medical Center, 5841 South Maryland Avenue, MC 5031, Chicago, IL 60637, USA Julian J. H. Leong, MA, MBBS, MRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Michalis Liontos, MD Department of Histology–Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Michael T. Longaker, MD, MBA The Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford University, 257 Campus Drive, Stanford, CA 94305-5148, USA
[email protected] Richard E. Lovegrove, MBBS, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Peter W. Marcello, FACS Department of Colon & Rectal Surgery, Lahey Clinic, 41 Mall Road, Burlington, MA 01805, USA
[email protected] Contributors
Contributors
xxvii
William A. Marston, FACS University of North Carolina, 3025 Burnett Womack, Chapel Hill, NC 27599, USA Erik Mayer, BSc (Hons), MBBS, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St Mary’s Hospital Campus, Praed Street, London, W2 1NY, UK
[email protected] Robin McLeod, MD, FRCSC, FACS Department of Surgery, University of Toronto, Toronto, ON, Canada
[email protected] Krishna Moorthy, MS, MD, FRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Alison Mortlock, BSc, PhD The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College London, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Michael W. Mulholland, MD, PhD University of Michigan Health Systems, 2101 Taubman Center/SPC 5346, 1500 East Medical Center Drive, Ann Arbor, MI 48109, USA
[email protected] Bari Murtuza, MA, PhD, FRCS (Eng) The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Kamal Nagpal, MBBS, MS, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor QEQM Building, St. Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] Jagdeep Nanchahal, BSc, PhD, MBBS, FRCS (Plast), FRACS Kennedy Institute of Rheumatology Division, Imperial College London, 1 Aspenlea Road, London W6 8RF, UK
[email protected] Zhifang Ni, MSc The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust at St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] xxviii
Marios Nicolaou, BMedSci, BM, BS, MRCS, PhD Imperial College London, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, L ondon W2 1NY, UK
[email protected] Sukhmeet S. Panesar, MBBS, BSc (Hons), AICSM National Patient Safety Agency, 4 – 8 Maple Street London, W1T 5HD, UK
[email protected] Panagiota Papanagnou, BSc Department of Histology & Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Paraskevas Paraskeva, PhD, FRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Ioannis S. Pateras, MD Department of Histology-Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece Carlos A. Pellegrini, MD, FACS Department of Surgery, University of Washington, Box 356410, Seattle, WA 98195-6410, USA
[email protected] John Pepper, FRCS National Heart and Lung Institute, Imperial College, London and Clinical Trials and Evaluation Unit, Royal Brompton and Harefield NHS Trust, Sydney Street, London, UK Lawrence D. Phillips, PhD The Department of Management, London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK
[email protected] Ronnie Tung-Ping Poon, MBBS, MS, PhD, FRCS (Edin), FACS Department of Surgery, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong, China
[email protected] Shirish Prabhudesai, MS, MRCS Bart’s and the London Hospital NHS Trust, The Royal London Hospital, Whitechapel, London E1 1BB, UK
[email protected] Oliver Priest, MBChB, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Contributors
Contributors
xxix
Sanjay Purkayastha, MD, MRCS Department of Biosurgery and Surgical Technology, Imperial College London, QEQM Building, St. Mary’s Hospital, 10th Floor, Praed Street, London W2 1NY, UK
[email protected] Mohammed Shamim Rahman, MBBS, MRCP The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust at St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Christopher Rao, MBBS, BSc (Hons) Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust at St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Richard Reznick, MD Department of Surgery, University of Toronto, 100 College Street, 311, Toronto, ON, M5G 1L5, Canada
[email protected] Gretta Roberts, BSc, PhD The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, St Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] Gary C. Roper Imperial College London, Imperial College Healthcare NHS Trust, AHSC Joint Research Office, G02 Sir Alexander Fleming Building, Exhibition Road, London SW7 2AZ, UK
[email protected] Simon Rowland The Departments of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College London, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Karen M. Sergiou Research Office, Imperial College London, Exhibition Road, London SW7 2AZ, UK k.sergiou @imperial.ac.uk Nick Sevdalis, PhD National Institute for Health Research, Imperial Centre for Patient Safety and Service Quality, Imperial College London, London, and Clinical Safety Research Unit, The Department of Biosurgery and Surgical Technology, Imperial College London, London, UK
[email protected] Conor J. Shields, BSc, MD, FRCSI Department of Surgery, Mater Misericordiae University Hospital, Eccles Street, Dublin, Ireland
[email protected] xxx
Fotios Sianis, PhD Department of Mathematics, University of Athens, Panepistemiopolis, Athens 15784, Greece
[email protected] Weiming Siow, MBBS, BSc (Hons) North Middlesex University NHS Hospital, Sterling Way, London N18 1QX, UK
[email protected] Petros Skapinakis, MD, MPH, PhD University of Ioannina, School of Medicine, Ioannina 45110, Greece p.skapinakis @gmail.com Mark A. Talamini, MD, FACS Department of Surgery, University of California at San Diego, 200 West Arbor Drive, 8400 San Diego, CA 92103, USA
[email protected] David G.T. Thomas, FRCS The National Hospital for Neurology and Neurosurgery, Institute of Neurology, Queen Square, London WC1N 3BG, UK
[email protected] Neil Samuel Tolley, MD, FRCS, DLO Department of Ear, Nose and Throat Surgery, St Mary’s Hospital, Imperial Hospital NHS Healthcare Trust, Praed Street, London W2 1NY, UK
[email protected] Sana Usman, BSc, MBBS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Justin Vale, MS, FRCS (Urol) Imperial College Healthcare NHS Trust, St Mary’s Hospital, Praed Street, London W2 1NY, UK
[email protected] Charles Vincent, BA, MPhil, PhD The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK James Wall, MD The Department of Surgery, Stanford University School of Medicine, Stanford University, 257 Campus Drive, Stanford, CA 94305-5148, USA Oliver Warren, BSc (Hons), MRCS (Eng) The Department of BioSurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Contributors
Contributors
xxxi
Laurence Watkins, FRCS Victor Horsley Department of Neurosurgery, The National Hospital for Neurology and Neurosurgery, Queen Square, London WC1N 3BG, UK
[email protected] Donna Winderbank-Scott, MBBS, BSc, AICSM The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] Desmond C. Winter, MD, FRCSI Department of Surgery, St. Vincent’s University Hospital, Dublin, Ireland
[email protected] John Wong, MBBS, PhD, FRACS, FACS Department of Surgery, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong, China
[email protected] Danny Yakoub, MBBCh, MSc, MRCSEd Department of Surgery, Staten Island University Hospital, 475 Seaview Avenue, Staten Island, New York, NY 10305, USA
[email protected] Guang-Zhong Yang Institute of Biomedical Engineering, Imperial College London, London, and Royal Society/Wolfson MIC Laboratory, 305/306 Huxley Building, Department of Computing, 180 Queens Gate, Imperial College of Science, Technology, and Medicine, London SW7 2BZ, UK
[email protected] Michael Zachariadis, BSc, PhD Department of Histology & Embryology, Molecular Carcinogenesis Group, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, and Department of Anatomy, Medical School, University of Athens, 75 Mikras Asias Street, Goudi, Athens 11527, Greece
[email protected] Paul Ziprin, MBBS, MD, FRCS The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK
[email protected] 1
The Role of Surgical Research Omer Aziz and John G. Hunter
Contents 1.1
Introduction ............................................................
1
1.2
The Aims of Surgical Research .............................
2
1.3
Translating Surgical Research into Practice........
3
1.4
Challenges Faced by the Twenty-First Century Academic Surgeon ..................................
5
The Role of the Academic Surgeon in Teaching ..............................................................
7
The Future of Surgical Research ..........................
7
References ...........................................................................
7
1.5 1.6
O. Aziz () Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor QEQM Building, St. Mary’s Hospital, Praed Street, London W2 1NY, UK e-mail:
[email protected] Abstract This chapter outlines the role of surgical research in advancing clinical knowledge, achieving better clinical outcomes and ultimately improving the quality of patient care. It reviews the origins of surgical research and the challenges that need to be overcome if it is to survive, describing the importance of translation of research into clinical practice through better trial design, information dissemination and teaching. Finally, this chapter looks to the future of academic surgery and the shape that this may take.
1.1 Introduction Historically, research has played a crucial role in the advancement of medicine, our understanding of disease processes and the way that we study them. Clinicians and health care professionals across specialties and disciplines now use research in almost every aspect of their working lives in order to guide an evidence-based practice, evaluate the effectiveness of new therapies or demonstrate the efficacy of new health care technologies. The ultimate aim of clinical research is to improve the management that patients receive in order to achieve the best possible outcome for them. Financial support through government-funded grants, charities and the commercial sector has been a key driver for this and has led to the establishment of institutional clinical research units that employ academic clinicians across a range of disciplines. These academics are judged by both the quality and originality of the research their units produce, and are sustained by their fund-raising ability. While this has certainly raised the standard of research through improved trial design, execution and reporting, there remain areas within medicine where both the nature of the disease processes involved and the ethical dilemmas associated
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_1, © Springer-Verlag Berlin Heidelberg 2010
1
2
with studying them require a special appreciation of clinical investigative tools. The study of surgical disease is one such challenging area and has historically led to research that has been largely observational, reflective and retrospective in nature. In modern times, the surgical research community has responded to this by developing novel and effective investigative tools to become, perhaps, one of the most rapidly growing and exciting areas of investigative clinical activity. This chapter highlights the role of surgical research in the advancement of clinical science to date, and outlines the challenges that lie ahead. It sets the scene for the rest of this text, which aims to cover many of the crucial advances in surgical research methodology, thereby providing a platform for the latest breed of academic surgeons to launch their investigative career.
1.2 The Aims of Surgical Research When considering the challenges faced by academic surgeons and their aims to overcome these, it is important to appreciate the origins of surgical research. Historically, the vast majority of “great surgeons” have been individuals with an undeterred drive to be leaders in their field. They have been stereotyped as inquisitive, challenging and bold individuals with a confidence to succeed as well as a passion for perfectionism through knowledge. The high profile that these individuals have enjoyed maintains itself through the disease processes, procedures and instruments they have had named after themselves, making them the forefathers of surgery. John Hunter’s (1728–1793) work on human anatomical research through the application of rigorous scientific experimentation [6], Theodor Billroth’s (1829–1894) landmark gastro-intestinal surgical procedures [5] and Alfred Blalock’s (1899–1964) study of shock and the Tetralogy of Fallot [2] serve as examples of the immortality of the contributions of these surgeons. It is through this inquisitive environment that “academic surgery” has seen its birth, with observations stimulating investigative study, publication of practical findings and resulting clinical applications. The nature of investigative research itself has evolved dramatically, with much of what was acceptable before now becoming ethically unjustifiable. A great example of this is the French surgeon Ambroise Paré’s (1510–1590) experiment on Bezoars [9]. It was
O. Aziz and J. G. Hunter
at the time a commonly-held belief that the antidote to any poison was the Bezoar stone. Sceptical of this hypothesis, Paré designed an opportunistic experiment on a servant in his staff who had been caught stealing and was to be hanged for this. He convinced the servant to being poisoned instead, with the understanding that immediately after taking the poison, he would agree to also ingest a piece of Bezoar. If he lived, he would be free. The servant died from the poison, and Paré’s observation disproved this hypothesis in an extremely simplistic yet dramatic fashion. In modern surgical research, we can thankfully no longer ethically justify such types of study, yet the underlying inquisitiveness and problem-solving approach of the academic surgeon are key ingredients required in investigative surgical research to this day. The sophistication of diagnostic, study design and statistical tools now available to the academic surgeon have improved dramatically. Studies are now designed to appreciate and minimise bias, have clear outcome measures and are of adequate sample size and power to be able to prove significance of findings. The stereotypical attitude that modern surgical research remains largely reflective and observational is as a result unjustified and should rapidly be dispelled. Academic surgeons in the twenty-first century have a strong moral and ethical responsibility that balances the rate at which their advances can be tested, yet the research they produce reaches a wider audience than ever before through electronic websites and research forums on the world wide web. This has an enabling effect not only within the profession, but also on patients who are better informed and more sophisticated in the choices they make than at any other point in history. Our challenge now is to treat a medically educated patient population who demands a more “personalised” health care service. The ultimate aim of research in surgery is to improve the health care of patients through the advancement of knowledge about surgical conditions, interventions, training and new technologies. Academic clinical institutions, research infrastructure, faculty and patients have allied aims (Fig. 1.1). Surgical journals, societies and websites fulfil a much wider role of providing the forum for surgeons to collaborate, learn from each other and disseminate their knowledge and experience. The ability of the academic surgeon to balance the activities of teaching, investigative research and patient care is key to a successful career in surgical research.
1
The Role of Surgical Research
3
Fig. 1.1 Close links between surgical research and clinical practice
Faculty Investigators & Senior investigators
Trainees
Associates
Universities
Infrastructure
Hospitals
Clinical Research Networks
Patients & Public
Clinical Research Facilities & Centres
Research
Research Projects & Programmes
Research Units & Schools
Research Governance Systems
Research Information Systems
Systems
1.3 Translating Surgical Research into Practice The practice of evidence-based medicine promotes the translation of high quality research into clinical practice. As a result, a very important responsibility is now placed on all types of research, with surgery being no exception. The ease with which research output can now be accessed by health care professionals and patients over the internet means that the influence research has on clinical practice is greater than at any other point in time. The importance that the surgical community itself places on research is well demonstrated by the fact that research is now actively encouraged at a very early stage in surgical training, and is a key part of most residency programmes, their equivalents and fellowships across the world. Examples of this can be seen in the Western World over the last two decades through the intercalation of surgical residency programmes with full-time research degrees. The United States was the first country to adopt the MD/ PhD dual degree, with a large number of medical schools across America now offering such combined programmes. The majority of these programmes, however, have historically focussed on medical as opposed to surgical research largely because the length of surgical training deters many candidates from a research degree, and also because the quality of medical research is deemed to be higher [8]. Despite this, there is now a
very important place in academic surgery for the dual degree in establishing research credibility and launching an investigative surgical career. In the UK, we have seen the emergence of the PhD as the most credible research degree to be undertaken in surgical training. While this has been less integrated when compared to the American dual degree programmes, a large number of motivated candidates have been allowed to undertake this full-time research degree outside of their standard training. Other academic degrees pursued by surgical trainees include the MD and MSc, and are of varying time intensity. With time, it is likely that the UK will adopt a more integrated academic and clinical surgical training programme through the proposed changes to the British system brought about by the “Modernising Medical Careers” (MMC) initiative [3]. Ultimately, all these translate to there being more resource available to undertake research and more encouragement to take part in it than in previous times, with the days numbered when the academic surgeon practiced little clinically, and the clinical surgeon practiced little academically. It is encouraging that the number of investigative surgeons looks likely to increase, yet it is still important to see how their work is likely to translate into clinical practice. A key aspect of surgical research has always been the publication of individual experience and the use of new techniques in the form of technical operative notes as well as case series. While the validity of the surgical case report or observational case series and the
4
true value this adds to clinical practice has been questioned by some [7], the nature of surgery as an extremely practical specialty means that there is clearly a role for this type of research in modern times. Historical examples of case series such as those published by Joseph Lister showing that carbolic acid solution swabbed on wounds markedly reduced the incidence of gangrene are remembered for their importance in advancing the antiseptic principles [13], but modern times have seen important developments in surgical technology. This is perhaps best illustrated with how laparoscopy originated and was taken up. Initially reported in 1901 by German surgeon George Kelling who used a cystoscope to perform “koelioscopie” by peering into the abdomen of a dog after insufflating it with air, this was soon followed by Jacobeus who published his first report of what he called “Laparothorakoskopie” [14]. From this point, a number of reports of laparoscopic procedures began to emerge, but mainly for diagnostic purposes. The advent of rod-lens optical systems and cold light fiber-glass illumination meant that laparoscopy could become more widespread mainly in diagnostic gynaecological procedures. In 1987, the French gynaecologist Mouret performed the first acknowledged laparoscopic cholecystectomy by means of four trocars. The rest is history, and over the last decade, laparoscopy has advanced tremendously to the point where in the Western World, laparoscopic cholecystectomy has almost eradicated its open counterpart from clinical practice [10]. An enormous amount of randomised and non-randomised comparative research comparing laparoscopy to open surgery for a range of procedure has since been published, with meta-analyses of these trials showing the vast improvement in postoperative recovery that laparoscopy provides the patient. Research on laparoscopy and its benefits is perhaps the best examples in modern times of how surgical technology and techniques can be identified through publication of individual experience, and then trialled in the context of high quality randomised clinical research, with meta-analysis finally being used to determine the pooled effect of a number of clinical trials. Looking to the future, this trend is set to continue, and we are seeing new technologies such as robotics [1], surgical microdevices [4] and natural orifice totally endoscopic surgery [16] emerging with the potential to transform surgery in a similar way. Randomised controlled trials have for a while been the pillar of clinical research, representing the highest quality of study in medical practice. The aim of these trials is to randomise patients in unbiased manner in an attempt to assess the effects of an intervention. In
O. Aziz and J. G. Hunter
surgical research, however, there are only a minority of studies that can achieve a valid randomisation scheme, because of not only the nature of surgical interventions, but also ethical dilemmas. Randomised controlled trials in surgical patients have as a result been much more difficult to perform when compared to the rest of medicine and especially pharmacological interventions wherein the placebo has enabled much of the required blinding to take place. There are, however, other reasons why randomised controlled trials are difficult to undertake in surgery. First, surgical disease often presents in patients who themselves are a very heterogeneous group and often older. For example, seeing the effect of a new drug on generally healthy young adults with essential hypertension is much more straightforward with regard to patient matching when compared to evaluating a surgical intervention such as renal transplant in older patients with end-stage dialysis-dependent renal failure. Second, the nature of the surgical intervention can in itself be heterogeneous, varying not only with the experience of the surgeon, but also with the experience of the institution. As a result, surgical multi-centre trials are often unable to account for the differences in the skill levels of different surgeons, either between centres or across a country, which makes the applicability of randomised controlled trials difficult when it comes to many surgical interventions [15]. These difficulties have had an important impact on the funding that support surgical research receives from funding agencies. It is often easier to see how a trial solving a basic science or pharmacological question may come up with a solution when compared to a surgical trial. Ultimately, however, it is the responsibility of the research community to face and try to solve the uncertainty of clinical surgical research by understanding the nature of disease and using new statistical tools to overcome the challenges of trial design. The use of metaanalytic techniques is perhaps one of the best examples of how these limitations can be overcome. Finally, to the practicing surgeon, the way in which all this knowledge is disseminated is of great importance. Surgical journals, societies and websites are excellent examples of academic collaboration, acting as places where people are able to exchange their ideas and compare outcomes. The impact that this type of activity has can be almost immediate, resulting in a change in a surgeon’s daily practice. The use of surgical websites to disseminate information and experience is a newer phenomenon that continues to be developed, and among the latest generation of surgeons, it is the most direct method of learning when and where they need them. Examples of existing resources include anatomy
1
The Role of Surgical Research
5
websites, operative guides, reference texts and formularies, which are covered later on in this text.
1.4 Challenges Faced by the TwentyFirst Century Academic Surgeon The first consideration for the surgical research community in the early twenty-first century is deciding what type of research surgeons should be undertaking and where. Surgical research has been traditionally divided into clinical and non-clinical, with the latter being predominated by the study of basic science. Clinical surgical research can be thought to be patientorientated, epidemiological, behavioural, outcomes related and health services research. Patient-oriented research itself can be divided into research on mechanisms of human disease, therapeutic interventions, clinical trials and development of new technologies. Whilst this classification aims to be as broad as possible, the emergence of new fields and advancements in biological sciences are hazing the boundaries between differing research types. Recent times have seen the emergence of “biomedical science” where a multi-disciplinary approach to research is adopted, combining both clinical and non-clinical themes. A prime example is biomedical engineering, which is a fusion of medical science, engineering, computer science and biochemistry. When combined, these specialties have led to a more free-thinking approach to research with transparency in the way that surgical ideas are shared. There is also an important synergy in the opportunities for
competitive grant funding, which are significantly higher with a combined approach. This trend has seen academic institutions across the world setting up biomedical institutes where medical and non-medical researchers work together. At Imperial College London, for example, a newly established Institute of Biomedical Engineering promotes this inter-disciplinary potential in biomedical research, aiming to be an international centre of excellence in biomedical engineering research. It encourages collaboration between engineers, scientists, surgeons and medical researchers to tackle major challenges in modern health care, using enabling technologies such as bionics, tissue engineering, image analysis and bio-nanotechnology. The institution is organised into “Research Themes” that are designed to attract major funding. These themes are managed by the Technology Networks, each headed by a committee drawn from the key researchers in the field from across the university. This trend is rapidly growing across the world, with similar institutes emerging, for example, from Oxford University, University of Zurich and University of Minnesota to name but a few. The establishment of a multidisciplinary approach to research has increased not only the depth and quality of surgical research, but also its appeal to a wider audience. Biomedical research is able to access a significantly larger amount of funding than clinical research alone. What is clear is that research tools are widely disseminated among the surgical community, and the research itself is probably best performed in focused institutions such as Academic Health Science Centres with proven academic pedigree and worldwide credentials (Fig. 1.2).
World class AMCs deliver against the three missions...
Patient Care
Teaching
Research
Talent ...through strength in core capabilities...
External partnerships Financial strength
Fig. 1.2 Characteristics of Academic Health Science Centres
...founded on an aligned partnership
ACADEMIC
Brand Infrastructure Operating processes
HOSPITAL
6
The second important consideration for the surgical research community is how it funds itself. When compared to other forms of medical research, surgery to date has not able to attract similar quantities of funds, which is partly due to a lack of influence exerted by academic surgical leaders to make the case for funding surgical over medical research. It has also been the case that randomised trials receive funding preferentially compared to non-randomised research, with the former being more challenging in surgery than medicine [15]. The importance of having surgeons as leaders of research on selection committees and grant awarding committees has been a problem that needs to be addressed. It is known that the National Institutes of Health (NIH), which is the major source of biomedical funding in the United States, conveys a much less welcoming attitude towards surgical research than other types of clinical or basic science research. There is evidence to suggest that since 1975, surgeons have been disproportionately less successful than researchers in other clinical disciplines in obtaining funding [11]. At the NIH, the principal decisions for peer review of research and selection for grant funding are made by a group of about 10–20 individuals with expertise in their given field. To date, there have been few study sections devoted to surgically oriented clinical research, with only two out of a hundred study sections present in which surgeons are even a reasonable minority of the committee members. It is not surprising, therefore, that in comparison with medical research departments, grant proposals from surgical departments are less likely to be funded, and if they do receive funding, this is likely to be a smaller amount. The most likely cause of this problem is that the surgical profession has failed to address the problem of developing and sustaining an adequate research workforce [15]. The third consideration for the surgical research community is training surgeons to take on the skills required to undertake and interpret their research. While we have discussed the role of formal research programmes in surgical training, on a more general note, there is a need to provide a structure, organisation and oversight to the research training that all surgeons receive. It is important to instil the scientific disciplines that form the cornerstone of all basic, translational or clinical surgical research into all surgical residents, registrars, fellows and consultants.
O. Aziz and J. G. Hunter
All surgical residency programmes should offer research education to its trainees in the form of biostatistics, bioethics, experimental design and principles of clinical research, including clinical trials and database management. Education of the surgical workforce as a whole is key to reaching a situation in which all involved in clinical surgical practice can read, listen and think critically with scientific discipline. Following are a number of possible organisational routes that have been described for training surgeon scientists: 1. A period of full-time clinical experience (usually 2 years of a residency or its equivalent) followed by 1–3 years of research, followed by another period of full-time clinical experience to complete the residency. This has historically been the most common way to undertake a full research degree. 2. A period of full-time clinical experience, followed by integrated research and clinical training to complete the residency. 3. A full clinical residency followed by several years of full-time research. In reality, the more integrated a research programme the better, as the aspiring surgeon scientist may then have opportunities to obtain a faculty position in the research department in which they are based, and initiate a research programme without a potentially disruptive 2–3 year lapse. Through this process, it is also desirable that the aspiring surgeon scientist be supported during their time of vulnerability to intrusions by clinical practice, teaching and administrative responsibilities. Finally, the academic surgeon faces an interesting dilemma. In devoting their time to academic research, academic surgeons sacrifice the time they would otherwise spend in active clinical practice, which is important in gaining experience and credibility amongst the surgical community. This is especially challenging because at the same time, the academic surgeon is also judged as a researcher by the research institution that employs them, namely through their research output and grants generated. It is simply a case of “publish or perish”, requiring a fine balancing act in maintaining excellence in both trades [12]. The best respected academic surgeons are often known to be excellent surgeons and successful academics.
1
The Role of Surgical Research
1.5 The Role of the Academic Surgeon in Teaching Surgical education, both at undergraduate and postgraduate levels, has traditionally been left to the academic surgeon to deliver, usually through their employing university. At all levels, delivering surgical education is an extremely important part of academic surgery and one that is extremely poorly rewarded at present. Academic surgeons are often not remunerated proportionately for the time they spend teaching, and the time they are allocated for this purpose often clashes with both their clinical and academic commitments. As a result, the standard of education delivered by academics varies dramatically from one individual to the next. In addition to this, most academics are judged by their research output (high impact journal papers, conference presentations) and the amount of grant funding they are able to attract. Teaching is very poorly rewarded, and as a result, academic surgeons want to spend as little time as possible delivering it. In the UK, all universities undergo a research assessment exercise (RAE) that determines how high quality their institution is compared to the rest of the world. This information is then used to determine the grant for research that the institution will receive from one of four government funding agencies. The process is a very arduous and thorough one, but does not reward teaching in the same way that it rewards research, contributing to the above-mentioned effect.
7
fashion that only the great forefathers of surgery were able to, and should play an active role in technology development, patenting and commercialisation. It is an environment where innovation is set to thrive, but only will do so if it is encouraged. The future of academic surgery ultimately lies in the hands of its leaders, namely the professors of surgery, heads of academic surgical departments and surgical department chairpersons, who have a responsibility to protect and support their young investigators so that they may set up productive programmes funded with peer-reviewed grant support. The most vulnerable time for these investigators is when they are within the first 5 years of completion of their training in which they require guidance, high-quality training, discipline and critical scientific rigor. Integrating an academic surgical career with a clinical one is also a key because one cannot exist without the other, and academic surgeons should be protected as they learn to juggle the responsibilities of a joint clinical and academic career. Finally, academic surgical programmes need substantial expansion into clinical trials through the development of fellowships in clinical research, lead clinical trial programmes, outcome studies, database research and quality improvement programmes. The future of academic surgery looks bright, but lies firmly in our own hands.
References 1.6 The Future of Surgical Research At the dawn of the twenty-first century, academic surgery finds itself at a very important crossroads. This is a time when academic surgeons are expected to be clinicians, surgeons, researchers, statisticians, educators and entrepreneurs generating the funds required to undertake their research with excellence. It is also a technological age where new advances are being made at a frightfully rapid pace, with the arrival of new surgical instrumentation and tools almost daily. The “technology push” that we face from the medical device sector is immense and largely financially driven, yet it should be seen as a real opportunity. Surgeons of today have the chance to turn into inventors in a
1. Berryhill R, Jhaveri JJ, Yadav R et al (2008) Robotic prostatectomy: a review of outcomes compared with laparoscopic and open approaches. Urology 72:15–23 2. Blalock A, Taussig HB (1984) Landmark article May 19, 1945: the surgical treatment of malformations of the heart in which there is pulmonary stenosis or pulmonary atresia. By Alfred Blalock and Helen B. Taussig. JAMA 251:2123–2138 3. Brennan PA, McCaul JA (2007) The future of academic surgery – a consensus conference held at the Royal College of Surgeons of England, 2 September 2005. Br J Oral Maxillofac Surg 45:488–489 4. Chang WC, Sretavan DW (2007) Microtechnology in medicine: the emergence of surgical microdevices. Clin Neurosurg 54:137–147 5. Ellis H (2008) The first successful gastrectomy. J Perioper Pract 18:34 6. Evans CH (2007) John Hunter and the origins of modern orthopaedic research. J Orthop Res 25:556–560 7. Horton R (1996) Surgical research or comic opera: questions, but few answers. Lancet 347:984–985
8 8. Jones RS, Debas HT (2004) Research: a vital component of optimal patient care in the United States. Ann Surg 240:573–577 9. Nau JY (2007) [A great humanitarian and surgeon: Ambroise Pare]. Rev Med Suisse 3:2923 10. Polychronidis A, Laftsidis P, Bounovas A et al (2008) Twenty years of laparoscopic cholecystectomy: Philippe Mouret–March 17, 1987. JSLS 12:109–111 11. Rangel SJ, Efron B, Moss RL (2002) Recent trends in National Institutes of Health funding of surgical research. Ann Surg 236:277–286; discussion 286–277
O. Aziz and J. G. Hunter 12. Souba WW, Wilmore DW (2000) Judging surgical research: how should we evaluate performance and measure value? Ann Surg 232:32–41 13. Tan SY, Tasaki A (2007) Joseph Lister (1827–1912): father of antisepsis. Singapore Med J 48:605–606 14. Vecchio R, MacFayden BV, Palazzo F (2000) History of laparoscopic surgery. Panminerva Med 42:87–90 15. Weil RJ (2004) The future of surgical research. PLoS Med 1:e13 16. Zehetner J, Wayand WU (2008) NOTES – a new era? Hepatogastroenterology 55:8–12
2
Evidence-Based Surgery Hutan Ashrafian, Nick Sevdalis, and Thanos Athanasiou
Contents 2.1
Introduction ............................................................
9
2.2
What Is Evidence?..................................................
10
2.3
Hierarchy of Evidence ...........................................
10
2.4
Definition and Values .............................................
11
2.5
Benefits of Evidence-Based Surgery .....................
11
2.6
History and the So-Called “Discord” Between Surgery and Evidence-Based Practice ..
12
2.7
Principles of Identifying the Evidence..................
13
2.8
Sources of Evidence................................................
14
2.9
Managing the Increasing Volume of Evidence ..............................................................
15
2.10
Practising and Delivering the Evidence ...............
15
2.11
Surgical Modelling and Treatment Networks......
16
2.11.1 Surgical Thinking and Evidence Synthesis .............. 2.11.2 Bayesian Techniques ................................................ 2.11.3 Qualitative Comparative Analysis (QCA)................
17 17 18
2.12
Surgical Decision-Making and Clinical Judgement Analysis ..........................
18
2.13
Cost Effectiveness in Evidence-Based Surgery ....
20
2.14
Surgical Training in Evidence-Based Techniques...............................................................
22
2.15
Ethics .......................................................................
22
2.16
Conclusion...............................................................
23
References ...........................................................................
25
H. Ashrafian () The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK e-mail:
[email protected] Abstract Evidence-based surgery (EBS) involves the integration of the best clinical and scientific evidence to treat patients. Best evidence is derived from the research literature and can be categorised into a hierarchy of levels. Application of the knowledge derived from “best evidence” results in enhanced care for patients and also improved standards for surgeons and health care institutions. The sources of surgical evidence are discussed, and we review surgical training in evidence-based practice. Techniques for answering a surgical question using an evidence-based method and for practising surgery in an evidence-based environment are described. Furthermore, we examine the role of treatment networks, cost-effectiveness, evidence synthesis, surgical decision making and the ethics of EBS. For the current and future surgeon, evidencebased practice is now an inevitable and fundamental component of the surgical profession. Its universal adoption will play an important role in the advancement of patient care and surgery worldwide.
2.1 Introduction Surgery has traditionally been considered a craft wherein individuals adopted techniques didactically from their teachers and performed each operation in a particular way because “that is how it was taught” to them. Throughout history, surgical practice has been dependent on learning through one’s own mistakes or those of others. Although a handful of exceptions did exist, such as the testing of medical efficacy by the eleventh century physician Avicenna [1], it was not until the late twentieth century that the concept of evidence-based medical practice came into fruition [2]. The concept of evidence-based practice involves the integration of the best available evidence to treat patients.
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_2, © Springer-Verlag Berlin Heidelberg 2010
9
10
H. Ashrafian et al.
It has become an inevitability in surgery and is now a requirement of all modern health care institutions, surgeons and patients alike [3]. This chapter aims to clarify the role of evidence-based surgery (EBS) and to contextualise its part in current and future surgical practice. As Birch and colleagues [4] stipulated, “It is no longer acceptable for a surgeon to be estranged from the current literature – the demands of colleagues, licensing bodies, and patients necessitate satisfactory knowledge of the best available evidence for surgical care”.
2.2 What Is Evidence? The Oxford English Dictionary defines Evidence as “information or signs indicating whether a belief or proposition is true or valid” [3]. As surgeons, we already apply this definition to our daily practice, so when we assess patients clinically, we all rely on our ability to draw clinical evidence from clinical signs and investigations, much in the same manner as Hippocrates did two and half thousand years ago.
The use of the word ‘Evidence’ in the context of evidence-base has a more specialised designation. EBS involves “the systematic, scientific and explicit use of current best evidence in making clinical decisions” [5].
2.3 Hierarchy of Evidence The evidence used in EBS is derived from published scientific research. In order to allow for a comparative evaluation of research data, hierarchies of evidence have been developed, which rank evidence according to its validity. Initially randomised controlled trials (RCTs) were considered as providing the highest level of evidence as these were deemed to have more validity by decreasing research bias and analysis error when compared to single case reports and retrospective cohort reviews. More recently, however, the Centre for Evidence-Based Medicine at Oxford produced a hierarchy list of Evidence types (Fig. 2.1). Here, they classify non-experimental information at the lowest levels of evidence and randomised control trials at a much higher level. The highest level of evidence corresponds
l Systematic Reviews of multiple RCTs
ll Properly designed RCT of appropriate size
lll Well-designed trials such as non-randomised trials, cohort studies, time series or matched case-controlled studies
lV well-designed non-experimental studies from more than one centre or research group
Fig. 2.1 Hierarchy of evidence pyramid based on the levels advocated by the Oxford Centre for evidencebased medicine (May 2001) [6]
V Opinions of respected authorities, based on clinical evidence, descriptive studies or reports of expert committees
2
Evidence-Based Surgery
11
to evidence sources wherein data from multiple randomised trials are integrated and appraised in the form of meta-analyses and systematic reviews.
2.4 Definition and Values The broad definition of EBS has been stipulated as “the integration of • Best research evidence (clinically relevant research, basic science, relating to diagnosis, treatment and prognosis) With • Clinical expertise (skills and experience adapted to a particular patient) and patient values (patient preference and attitudes to clinical entity and its overall management)” [7]. This broad definition can then be divided into two subcategories [8]: 1. Evidence-based surgical decision making – in which best evidence is applied to an individual or a finite group of surgical patients. 2. Evidence-based surgical guidelines – in which best evidence is applied at an institutional or national/ international group of surgical patients. In order to apply best-evidence (Fig. 2.2), raw data need to be processed by a number of different types of knowledge. These include the following: • • • •
Knowledge from research (refined data or evidence) Knowledge of measurement (statistical methodology) Knowledge from experience (judgments and decision) Knowledge of practice (leadership and management)
Thus, in order to carry out an evidence-based action, one needs to process raw clinical information by all the four knowledge types. An example would be a patient presenting with right iliac fossa pain and leucocytosis.
RAW DATA
REFINED DATA (EVIDENCE)
STATISTICAL ANALYSIS
As a clinician, one would use knowledge from research and experience to perform further investigations and make a provisional diagnosis. These data would be contextualised with data from the published literature as to the accuracy of the diagnosis and the best course of action based on the knowledge of measurement. If a surgical indication such as the need for appendicectomy was deduced, the knowledge of leadership and management would need to be applied to ensure the progression to operative management.
2.5 Benefits of Evidence-Based Surgery It has been predicted that “applying what we know already will have a bigger impact on health and disease than any drug or technology likely to be introduced in the next decade” [9]. The correct application of this knowledge is, therefore, critical in the development of future health care strategies and treatments. This has led to the concept that “Knowledge is the enemy of disease”, a metaphor that is both intuitive and has been increasingly applied by National Knowledge Service of the United Kingdom’s National Health Service [10]. Adopting a similar aim, the American College of Surgeons has also introduced the continuous quality improvement (CQI) [11] committee (initially known as the Office of Evidence-Based Surgery) [12], to promote the highest standards of surgical care through the evaluation of surgical outcomes in clinical practice. There is, however, a large discrepancy between what we know and what we currently apply [9], and EBS is a method by which this discrepancy can be bridged, but also built upon to fundamentally improve surgical health care outcomes. Many so-called surgical standards and customs are based on little if no evidence. For example, the application of post-operative wound drains [8, 13] and nasogastric tubes [14] is largely determined by habit and surgical apprenticeship as opposed to best evidence.
DECISION MAKING & JUDGEMENT
LEADERSHIP & MANAGEMENT
KNOWLEDGE TYPES
Fig. 2.2 Application of knowledge in the process of transforming raw data into evidence-based action
EVIDENCE-BASED ACTION
12
H. Ashrafian et al.
Furthermore, part of the reason why such a gulf exists between knowledge and surgical application lies in the traditional philosophy of local and national health care policy. Here, the traditional modus operandi was defined by a practice that aimed to “minimise costs”, rather than actually address why we practice surgery in the first place, namely to benefit the patient. The future of health care, therefore, lies in the practice of a system that is built on the fundamental concept of value to patients (so called the “valuebased” system). This model of health care can improve patient outcomes, quality of life, satisfaction and also financial cost [15]. At the heart of this value-based system is the practice of evidence-based surgery, which will enable care to be targeted for surgical diseases and patients as opposed to the “old fashioned” concept of treating patients by the “speciality” of their surgeons. Here, there will be a focus on risk-adjusted outcomes to improve patient care and satisfaction (Fig. 2.3). Furthermore, such a system will allow the enhancement of surgical training (particularly in an era of decreased training hours), improve surgical satisfaction and empower surgeons through improved leadership, management and decision making (Fig. 2.3).
2.6 History and the So-Called “Discord” Between Surgery and EvidenceBased Practice The historical perception that surgeons were unable to successfully apply evidence-based practice is not completely true. Many of the developments that led to its introduction in the twentieth century were spearheaded by surgeons. Working on the ideas of British physician
Sir Thomas Percival (1740–1804), the American surgeon Ernest Codman (1869–1940) established the first tumour registry in the United States in order to followup patient outcomes, identifying the most successful aspects of tumour treatment. He was a notable health care reformer, and is credited for introducing outcomes management in patient care through an “end results system” by following-up patients and their outcomes, essentially creating the first patient health care database. He introduced the first mortality and morbidity meetings and contributed to the founding of the American College of Surgeons and its Hospital Standardization Program [16]. James Alison Glover (1930s UK) and later Jack Wennberg (1970s USA) revealed that mass adoption of tonsillectomies was unrelated to tonsillar disease occurrence, which lead to a change in surgical practice in keeping with actual disease rates [17]. The famed cardiovascular surgeon, Michael DeBakey reported on the overall surgical experience in World War II as he had been working for the Army Surgeon General’s Office. His work described injury incidence, surgical management and analysis of outcomes [18]. Although our modern concept of evidence-based medicine was first described by Scottish physician Archie Cochrane in 1972 [19], the formal adoption of “Best Evidence” occurred approximately 20 years later by Gordon Guyatt and David Sackett in the 1990s [2]. In the interim, however, some surgeons attempted to introduce evidence-based practice by publishing and acting on surgical outcomes through the Health Care Financing Administration (HCFA) [20]. Despite these significant contributions, surgeons have been criticised with statements such as “a large proportion of the surgical literature is of questionable value”, and that surgeons perform in a “comic opera” in
‘customized care’ Improved safety
Improved care
Higher quality research increased efficiency of care
Institution
Patient
Incresed transparency of results
Accountability Incresed patient awareness
Evidence Based Surgery
Improved teamwork and cooperation Shared responsibility
Consistent national operating standards improved national guidelines
Improved satisfaction
Decision-making based on applied evidence
Healthcare
Identification of local healthcare strengths
Surgeon
Improved education and training Increased satisfaction Enhanced surgical research
Fig. 2.3 Evidence-based surgery at the centre of a value-based health care system
2
Evidence-Based Surgery
which “they suppose are statistics of operations” [21]. These indictments to the surgical fraternity have come about as surgical research has historically relied upon publishing data that are deemed to be of the “weakest evidence”, based mainly on case series as opposed to randomised trials and meta-analyses. The root causes of these have been assessed in the literature [22]: • Historical reasons (many operations progressed by small steps without RCTs). • Patients do not want their operations to be selected by randomisation (patient’s equipoise). • Operative variation makes standardisation of a surgical arm on an RCT difficult. • Urgent and Emergency Surgery makes inclusion into an RCT difficult. • The learning curve effect creates difficulty in analyzing RCTs. • Many surgeons do not have an adequate grounding in medical statistics. • Surgical RCTs have traditionally been poorly funded, whereas drug and technology companies have been more forthcoming in funding their device or procedure. • Surgeons have traditionally been poor at adopting RCTs as they have surgical equipoise and selfassuredness – “my operation is better”. The role of EBS is not to encourage a wider use of RCTs alone, but more importantly to arrive at surgical treatments that are considered “best evidence”. This does not always require an RCT and can include a wide corpus of other data that can be interpreted to select the best treatment for each patient.
2.7 Principles of Identifying the Evidence In order to carry out EBS, four components need to be achieved, the so called “Four Steps of EBS” [7]: Finding the best evidence is an essential skill for all modern surgeons. One well-applied method of performing a search is by utilising the “PICO” [5, 23] technique developed at McMaster University, where evidence-based methods have been taught for almost 30 years. Empirical studies demonstrate that PICO use improves the specificity and conceptual clarity of clinical problems, allows for more complex search strate-
13
1. Formulate a question based on a clinical situation encountered in daily practice. 2. Do a focused search of the relevant literature. 3. Critically appraise the literature obtained to find the best evidence. 4. Integrate the information and act in accordance with the best available evidence. gies and results in more precise search results [5, 12]. When asking a PICO question, it is important to: P: Patient population – Group for which you need evidence I: Intervention – Operation or treatment whose effect you need to study C: Comparison – What is the evidence that the proposed intervention produces better or worse results than no intervention or a different type of intervention? O: Outcomes – What are the effects and end-points of the intervention? • Be specific in your question(s) • Prioritise your questions • Ask answerable questions A poor question would be: “Is coronary artery bypass grafting better than angioplasty?” A better question is: “In diabetic Asian women aged over 75 with chronic stable angina and three vessel coronary artery disease, does off-pump coronary artery bypass grafting compare more favourably than percutaneous coronary intervention in terms of long-term mortality and physical quality of life?” This latter question can be derived from our PICO table: The next step is to work through the finding evidence sequence (Fig. 2.4). P – patient I – intervention C – comparison O – outcomes
Asian women aged over 75 years with diabetes Off-pump coronary artery bypass Percutaneous coronary intervention Long-term mortality and Quality of Life
14
H. Ashrafian et al. Do Guidelines Already Exist?
Formulate a PICO question
Own hospital Local National International
Do Systematic Reviews Exist in relevant databases?
Do studies exist that answers the PICO question? Examples include: Treatment Comparison - Randomized Control Trial(s) Disease Diagnosis - Prospective Study Treatment Prognosis - Cohort Study Treatment Harm - Case series or Case report Cost benefit - Economic Study Quality of Life - Questionnaire or Qualitative Study
Cochrane Library Health Technology Reviews PubMed
Fig. 2.4 Sequence of finding evidence in EBS based on McCulloch and Badenoch [24]
2.8 Sources of Evidence Sources of evidence can be selected from any of the categories of the “hierarchy of evidence” mentioned earlier. While traditionally these would have been discerned only from printed journal papers, many sources are now invariably available online through local intranets and the wider internet. These sources (Fig. 2.5) can be categorised as primary sources or journal articles (which exist in large number) or secondary sources (which exist as a large variety). Primary sources provide direct evidence or data concerning a topic under investigation, whereas secondary sources summarise or add commentary to primary sources. As depicted in the figure, there is a hierarchy of secondary sources to match the hierarchy of evidence
(Fig. 2.1). Thus, for example, textbooks are on the first run on the secondary sources pyramid, whereas the Cochrane Database of Systematic Reviews is at the peak of the source pyramid as it reveals information deemed to come from the highest level of evidence. By far the most frequent database used by surgeons is PubMed [26], a free internet-based search engine for accessing the MEDLINE database of citations and abstracts of biomedical research articles. Here the vast majority of citations are from 1966 onwards, although they do reach back as far as 1951. PubMed is not the only search engine used, and many surgeons may start a search using Google Inc. search engines such as standard Google [27] or Google Scholar [28]. Alternatively, they may go straight to a government guideline website or even a systematic review database.
Systematic Reviews: Cochrane Library, DARE
Secondary Sources
Critically Appraised Topics (CATs); CAT Crawler, Clinical Evidence Database, Institution-specific CATs, National/ International Guidelines
Critically Appraised Articles: ACP Journal Club, Trip Database
Textbooks
Fig. 2.5 Hierarchy of the sources of evidence in EBS based on McCulloch and Badenoch [24] and the University of Virginia, Health Sciences Library [25]
Primary Sources
Journal Articles: Medline, EMBASE (pubMed, Ovid, Silver Platter, Dialog/Datastar
2
Evidence-Based Surgery
When reading the evidence, it is important not to lose sight of scientific rationality, so as to acquire best evidence for best patient outcomes. The concepts of information validity need to be rigorously assessed. These include: • • • •
Bias Sample and study sizes Methodology Relevance
2.9 Managing the Increasing Volume of Evidence In 2000, the non-profit medical research organisation OpenClinical produced a white paper entitled “The medical knowledge crisis and its solution through knowledge management” [29]. Here they recognised that “few busy doctors have the time to do the reading that is necessary and, even if they are assiduous in their reading, the imperfect human capacity to remember and apply knowledge in the right way at the right time and under all circumstances is clearly a critical limitation”. As a result, they specified a number of fundamental points as follows: • It is now humanly impossible for unaided health care professionals to possess all the knowledge needed to deliver medical care with the efficacy and safety made possible by current scientific knowledge. • This can only get worse in the post-genomic era. • A potential solution to the knowledge crisis is the adoption of rigorous methods and technologies for knowledge management. In order to combat this “information overload”, a number of governmental bodies and medical groups propose knowledge management methods to enable clinicians to cope with this large volume of data [10, 22, 27]: • Clinicians may need to become more superspecialised – to have an intimate knowledge of their own field. • Multi-disciplinary teams, conference attendance and postgraduate teaching can help in the spreading of the most “important” evidence. • Local, National and international guidelines can be an easy-to-reach source of best evidence.
15
• Increased use of free-to-use governmental and institutional internet “web-portals” will make best evidence easily accessible. • Encouraged communication with interactive medical media and medical library facilities to facilitate evidence-based searching and practice. To achieve universal application of these evidencebased knowledge management principles, there is a general requirement for the whole surgical fraternity to adopt an “evidence-based culture”. This requires the support and acceptance of evidence-based teaching and practice at all levels of surgery, whether in medical schools, national hospitals, private practice or academic surgical units. As a result, individual clinicians can be constantly informed with the most up-to-date knowledge, whilst also ensuring that future generations of surgeons will be satisfactorily educated to achieve successful outcomes using the best evidence.
2.10 Practising and Delivering the Evidence The traditional practice of EBS was centred on individual clinicians taking responsibility for their own evidence-based practice. This would take place through required incentives to complete optional medical education accreditation for professional appraisals or awareness of either local or national guidelines. This optional adherence to evidence-based practice has led to a wide variability in its application [5]. It has recently been proposed that “Apart from health care professionals, the health care system itself and its influence on the delivery of care need to be considered” [5]. Measuring performance intrinsically allows the setting of a standard, which can then be compared against another standard, permitting a so-called system of “benchmarking” to be introduced. If the benchmarking measures are widely accepted, improved techniques in evaluating volume-output relationships and health inequality data can reveal important clinical outcome factors that can be improved upon. Implementing these changes requires the involvement of senior leadership, health care management and hospital boards to advance and promote the evidencebased culture. This will facilitate innovative organisational design and structure whilst benefiting from information management and technology [30].
16
H. Ashrafian et al.
2.11 Surgical Modelling and Treatment Networks
In order to create a visually representative model of comparing treatments that accounts for the last two conjectures, Salanti et al. [32] have devised a “geometry of the network”. Here, they specify that “network’s geometry may reflect the wider clinical context of the evidence and may be shaped by rational choices for treatment comparators or by specific biases”. In order to create a visual representation of all studies for a specific disease, there needs to be consideration for two factors: Diversity – The number of treatments for a specific condition. Co-occurrence – the relative frequency where two specific treatments have been compared against each other. Applying these factors, it is possible to represent all of the evidence for a specific condition (Fig. 2.6). Here, a line between two dots can represent a comparison of two treatments. The thicker the line, the larger the number of studies comparing the treatment (co-occurrence).The larger number of lines in a diagram implies increased diversity. An example would be the pharmacological treatment of LUTS – lower urinary symptoms (Fig. 2.6a). Here the centre of the image or “star” is placebo, to which all the treatments a–g are compared. The lines for a (a-Adrenoceptor antagonists) and b (5-a-Reductase inhibitors) are thicker than g (Antimuscarinics) and f (PDE5 inhibitors) as the former two are the subject of more studies as they are older drugs for this indication. Accordingly, lines g and f are in turn thicker than c, d and e, as these later
When studying the surgical evidence, most systematic reviews and meta-analyses concentrate on studying one procedure in isolation or by comparing one treatment to another. This methodology yields an incomplete view of the treatments available to treat each condition, and does not deliver a holistic comparison of all the treatments available to treat each specific surgical disease [31]. The corpus of different treatments available for each condition can be considered to comprise a “treatment network”, where each treatment can be represented in a common frame of reference against all the others. Recently, mathematical techniques such as “multiple treatment” comparisons and “mixed treatment metaanalysis” have been introduced to compare the data for medical treatments and interventions within such networks [16, 29]. Applying these techniques to study networks has resulted in two conjectures: 1. Any representation of the evidence network needs to account for the fact that all the treatments have not been studied equally. 2. Any representation of the evidence network needs to account for the fact that there are varying amounts of published data for different treatment comparisons.
a
b b
a b
a
c
c
h
g
f e
f
d e
Fig. 2.6 (a) Treatment Network Geometry for the pharmacological treatment of LUTS – lower urinary symptoms. a: a-Adrenoceptor antagonists, b: 5-a-Reductase inhibitors, c: Luteinizing hormone releasing hormone antagonists, d: b-3-adrenoceptor agonists, e: Vitamin D3 agonists, f: Antimuscarinics, g: PDE5 inhibi-
d
tors, h: placebo. (b) Treatment Network Geometry for the laparoscopic surgical treatment of morbid obesity. a: vertical banded gastroplasty, b: gastric banding, c: sleeve gastrectomy, d: roux-en-y gastric bypass, e: duodeno-jejunal bypass, f: no surgery
2
Evidence-Based Surgery
three are much newer drugs and have only a few studies considering their use. Another example would be that for the laparoscopic surgical treatment of morbid obesity Fig. 2.6b. Here, one can see that there are many studies comparing a (vertical banded gastroplasty), b (gastric banding) and d (roux-en-y gastric bypass) to no surgery f. However, there are also many studies purely comparing b (gastric banding) and d (roux-en-y gastric bypass) alone in the absence of a comparison to f (no surgery), leading to a non-star shape. As c is a newer procedure, it is more commonly compared to established treatments such as b (gastric banding) and d (roux-en-y gastric bypass) and less so with f (no surgery), hence the thinner line of c–f. Being a much older procedure, e (duodeno-jejunal bypass) is not commonly performed laparoscopically, and hence, has not been extensively compared to the other laparoscopic procedures, explaining the thin line e, f. The application of these geometric networks to surgery allows individuals to visualise in one diagram the overall treatment evidence for a particular condition. This empowers each individual surgeon to assess specific studies within the context of all known treatments for a condition. Furthermore, it can also allow for mathematical applications to place values on the strength on the levels of evidence for each treatment.
17
combine multiple sources of quantitative data from trials of heterogeneous quality and design (Fig. 2.7). Furthermore, these techniques can also allow for qualitative data to be included in these mathematical analyses. This technique, therefore, adds a new paradigm shift in the inclusion of traditional psychosocial and economic research to traditional trial evidence in order to allow for a totally holistic data analysis. The role of evidence synthesis is increasingly applied in the assessment of health care outcomes and technology, and now has an expanding role in The National Health and Medical Research Council of Australia [33] and the United Kingdom’s National Institute for Health and Clinical Excellence [34]. Many of the recent advances of Evidence Synthesis in Health Care assessment have been as a result of the application of Bayesian statistical theory, where the mathematical techniques allow the supplementation and enhancement of conventional meta-analysis. Other techniques include Qualitative Comparative Analysis wherein qualitative data can be modelled into mathematical values to facilitate knowledge comparison and analysis. These methods account for the study quality in evidence synthesis, as they can incorporate a broader body of evidence to support decision-making, and therefore, successfully address many analytical problems in EBS. These include baseline risk effects, study heterogeneity and indirect comparisons.
2.11.1 Surgical Thinking and Evidence Synthesis
2.11.2 Bayesian Techniques
As previously described, a good method of attaining good quality quantitative evidence for EBS is the application of meta-analysis and systematic reviews to integrate the results of high quality clinical trials. However, not all surgical evidence is in the format of high quality randomised trials, and some surgical questions may never be practical enough or ethically appropriate to obtain randomised controlled data. As a result, applying traditional methods of comparing trials by meta-analysis may not always be possible, and furthermore, publication of data based on these techniques alone can lead to a bias in the literature (focusing only on data that can only be accrued for RCTs). In order to accommodate for this lack of data whilst also applying the concept of Best Evidence, a technique has been introduced known as “Evidence Synthesis”. Here, statistical models are employed to
The perceived current “gold standard” of evidencebased research is information in the form of Randomised Control Trials or Systematic reviews appraising them. These study designs, however, are based on a “frequentist school”, where the results are expressed as a probability value (p value) that is extracted from a model whereby an infinite number of probabilities can occur on a distribution with an unknown estimated effect value that can only be expressed with confidence intervals [35]. Another approach is the Bayesian one, where all the analysis work is on known data and a prior belief which yields a credible value near the sample mean. This is now being increasingly used in medicine and is defined as the “explicit quantitative use of external evidence in the design, monitoring, analysis, interpretation and reporting” of research.
18
H. Ashrafian et al.
Quantitative Observational Studies
Randomized controlled trial (RCT)
PROSPECTIVE cohort or Case-Control Study
RETROSPECTIVE Cohort or Case-Control Study
Other Studies
Case-Series or Case-Reports
Anecdotal Evidence
Unpublished data
Surveys and Policy Data
Qualitative Data
Evidence Synthesis
Decision
Fig. 2.7 Evidence synthesis and the numerous sources that it can integrate and analyse
In order to calculate a probability using Bayesian Statistics (Fig. 2.8), there are the following parameters [36]: 1. Prior distribution – the probability of a parameter based on previous experience and trial data 2. Likelihood – probability of a parameter based on data from the current research study or trial. 3. Posterior distribution – the updated probability of a parameter based on our observation and treatment of the data. These techniques have proven to be particularly useful in [37]: • • • •
Grouped meta-analysis (cross design synthesis) Cost-effectiveness studies Comprehensive decision modelling Decision analysis
2.11.3 Qualitative Comparative Analysis (QCA) This method employs the use of truth tables to categorise qualitative data into mathematical values. This
requires the setting of thresholds to classify a finding as either positive or negative (binary 1 or 0). Based on these thresholds, studies can be classified into scores and can be assessed by Boolean algebra [37]. An example is given in Table 2.1. In this hypothetical set of manuscripts the Boolean equation follows that D = A + B + C. Thus, it can be seen that surgical errors can result from any or all of: poor communication, distractions in theatres and inexperienced operators.
2.12 Surgical Decision-Making and Clinical Judgement Analysis The main reason to perform EBS is to improve the quality of care and outcome of our patients by modifying our surgical judgements and decisions. According to a well-known surgical aphorism, “a good surgeon knows how to operate; a better surgeon knows when to operate; the best surgeon knows when not to operate” [39]. Appropriate clinical judgement and decisionmaking skills are considered to be of paramount importance in surgery, and hence, in the UK, they have
2
Evidence-Based Surgery
19
Fig. 2.8 Bayesian statistics Prior Distribution (resonable opinion excluding trial evidence)
Bayes’ Theorem:
Posterior Distribution
Posterior Probability = Prior × Likelihood × Constant
(Final Opinion)
Likelihood (trial evidence)
Table 2.1 Hypothetical truth table showing causes of surgical error Conditions (explanatory Surgical error Number of variable) (dependent reports variable) A
B
C
D
0
0
0
0
37
1
0
0
1
12
0
1
0
1
24
0
0
1
1
19
1
1
0
1
8
1
0
1
1
26
0
1
1
1
14
1 1 1 1 21 A = poor communication; B = distractions in theatres; C = inexperienced operators. Table format based on [38]
recently been explicitly included in the Intercollegiate Surgical Curriculum Project [26]. Surgical judgement and decision-making can range from very well-defined levels, with relatively narrow range of options (e.g. to use one surgical forceps over another during an operation), to less well-defined situations in which surgeons need to consult their patients before reaching a decision (e.g. whether to offer surgery to a patient given the stage of his/her disease and lifestyle factors) [28, 36]. Regardless of the level of
complexity of the situation, from a psychological perspective, optimal surgical judgment encompasses three components: Experience, Evidence and Inference (Fig. 2.9). Surgeons make judgements and decisions on the basis of available information and their interpretation of it. The gathering and processing of the information are cognitive processes, which are open to influences from both our “cognitive architecture” and also the external environment (Fig. 2.10). Simply put, we examine our patients and gather relevant diagnostic information (e.g. blood tests, laboratory findings, etc.). Each one of these pieces of information is a “cue”, which we use to form a clinical judgement; this judgement will then lead us to make a decision (e.g. to treat or not, whether to offer a laparoscopic or open procedure). In the process of forming a judgement, these cues are weighted – although we are usually not consciously aware of this weighting process (unless we are dealing with a very difficult decision). Our final decision is driven by the weightings of the individual cues that we have considered and also the influences of internal cognitive factors (e.g. our inherent limitations in processing large quantities of information) and external environmental influences (e.g. the time we have to see a patient in clinic or in a ward round) [28]. Psychologists have developed models that explain how this integration of the various cues works. A model of particular relevance to surgery is that
20
H. Ashrafian et al. Bayesian Pros - facilitates decision-making in the absence of rigorous data Inference
Cons - depends on assumptions and is subjective
Surgical Judgement Anecdotal Pros - can answer undefined elements Cons - affected by selective memory and ego
Frequentist Pros - applies rigorous probabilty data from large patient cohort
Experience
Evidence Cons - may not represent specific individual case
Fig. 2.9 Components of judgement. Based on Marshall [27]
Information Cue A (e.g. based on blood test) − given an importance weighting of x
Information Cue B (e.g. clinical finding) − given an importance weighting of y
Surgical Judgement leading to Decision Time Resources Physical discomfort Personal risk
Environmental factors
Cognitive factors
Fatigue Anger Competitiveness Guilt
Fig. 2.10 Factors influencing judgements leading to decisions
developed by Egon Brunswik and known as Social Judgement Theory – with its quantitative application, Clinical Judgement Analysis [40–43]. Social Judgement Theory treats surgical (or any other) judgement as a linear multiple regression model. Different cues are considered and assigned weights by the surgeon – thus, the relative importance for each cue can be algebraically estimated and surgeons can be classified into different subgroups, depending on the importance they assign to different cues: this reveals how different surgeons approach a clinical decision. This is important in EBS as it can allow judgements to be assessed before and after teaching and training in the adoption and application of best evidence. Clinical Judgement Analysis has been used in a number of surgical studies. It has been used to clarify how demographic and lifestyle factors impact the prioritisation decision for patients due to have elective general surgery [44] or cardiac surgery [45], what clinical factors urological surgeons consider when deciding the treatment of prostate cancer [46] and how expert
nephrologists diagnose non-end stage renal disease [47]. Importantly, this approach allows quantitative feedback to be provided to individual surgeons as a training intervention to improve their clinical decision-making [5, 23, 48].This is done by assessing personal decisions and the breakdown of individual weights (i.e. importance) given to each information cue. These results can then provide feedback and be modified for each clinician to achieve results that are based on the best evidence.
2.13 Cost Effectiveness in EvidenceBased Surgery It is becoming increasingly evident that although EBS can reveal the best treatments for a specific disease, the provision of these treatments may not be possible, particularly from a financial standpoint. In order to supply patients with the best care, both “best treatment” and “available funds” need to be considered. Achieving a
2
Evidence-Based Surgery
balance between these two factors can be complex and involves not only clinicians, but also health care management, economists and politicians. Economic considerations have traditionally been poorly represented in the evidence-based literature, for example, a systematic review examining the cost effectiveness of using prognostic information to select women with breast cancer for adjuvant chemotherapy only revealed five published papers in the field. Health care costs are now of utmost importance in today’s complex financial markets. In the United States alone, medical care consumes more than 14% of the gross domestic product [49], which could increase to 17.7% by 2012 [50]. Despite the constant rise in new medical treatments, it is now widely recognised that “health interventions are not free, people are not infinitely rich, and the budgets of health care programmes are limited. For every dollar’s worth of health care that is consumed, a dollar will be paid. While these payments can be laundered, disguised or hidden, they will not go away [13]. In order to incorporate economic considerations in evidence-based guidelines, cost-effectiveness analyses (CEAs)are now being utilised and are gaining increased importance in medical guidelines at all levels (local, national, international). CEAs can reveal the expected benefits, harms, and costs of adopting and translating a clinical recommendation into practice [51]. It is a tool that allows decisions to be made within the realistic constraints of health care budgets. This takes place through the expression of a cost-effective ratio where the differences in the cost between two interventions are divided by the difference of the health effects or outcomes of these interventions [52, 53]. C1–C2/O1–O2 C=Cost, O=Outcome The cost C is measured by: C = cost of intervention + cost induced by the intervention – costs averted by the intervention Outcomes O can be measured by: Life-years saved (LYS) = amount by which an intervention reduces or mortality or Quality-adjusted life years (QALY) = effect on an intervention on both loss and quality of life. Cost-effectiveness analyses that use QALYs are termed cost utility analyses (CUAs) and have become increasingly important as the incorporation of quality
21
of life in the assessment better reflects clinical reality and clinical decision-making. Other economic analyses applied to health services include cost-minimisation analyses and cost-benefit analyses (CBAs), although in contrast to CEA’s, they do not have the capacity to compare the value of cost compared to clinical outcomes [53]. CEA’s provide valuable information for developing and modifying health service interventions and preventative measures to obtain the best care with the best value. They can be used to compare the costs and benefits of various interventions for the same pathology or disease (for example colorectal screening by examining occult blood tests, barium enemas or colonoscopies). Furthermore, they can clarify the cost-benefit of which intervention is appropriate for: • Specific population subgroups (e.g. Off-Pump vs. On-Pump coronary artery bypass grafting in patients with renal dysfunction) • Specific population age (breast screening by mammography between the age of 50–70) • Various treatment frequencies and times (e.g. PAP testing for cervical neoplasia every 3 years) The use of QALYs as an outcome measure in CUAs has shown particular benefits in accounting for patient preferences for some health conditions over others. For example, although numerous trials report the effectiveness of tamoxifen in improving morbidity and mortality in breast and endometrial cancer patients, its effects on perceived health status in these different conditions vary. CUAs allow for such variation and provide policymakers with data that reflect both financial suitability but importantly population needs and preference [53]. Arguments against the use of CEAs include: • A historical lack of standardised CEAs making comparisons difficult • A paucity of studies • A lack of transparency in the complex models applied • QALYs being non-intuitive • Ethical concerns (for example is a year of life saved or QALY for a 70 year old equivalent to that for a 1 year old? Or the perception that CEAs can be used as tools for “rationing” in health care.) Financial considerations are nevertheless inevitable, and there are a number of considerations that can allow
22
best use of CEAs in evidence-based practice. These include the following [54]: • Consideration of resource use and not monetary valuesalone. • Consideration of the specific context of an intervention and the resources needed. • Applying a broad perspective, particularly at national and international levels. • Consideration of the quality of evidence and the quantity of resource expenditure. • Applying up-to-date economic models. CEAs, therefore, are a powerful tool in selecting evidence-based interventions and protocols that are best suited to the budgets restraints of health care institutions, while also accommodating the preferences of both clinicians and patients. As a result, CEA scores can be classified by a league table which permits the selection and prioritisation of treatments either locally within a health care institution, or at a broader national or international level.
2.14 Surgical Training in Evidence-Based Techniques In order to adhere to an evidence-based culture, training programmes in EBS are essential. When teaching this topic, it is important not only to teach the techniques of evidence searching (as above), but also to contextualise the whole process to make sense to the individual user at an individual institution. Principles of teaching these processes include the following [28]: • Compiling a list of sources of evidence • Identifying the influences on the decision-making and the role of evidence • Applying appropriate levels of evidence base for decisions in context • Discussion of the tactics on how to acquire evidence at an appropriate level • Discussion of implementation strategies To successfully fulfil all these steps, the teaching needs to be an interactive process. Ideally, it requires pairing of junior and senior clinicians together to allow mutual insight to be discerned from each others’ clinical experience. Furthermore, it is vital that a
H. Ashrafian et al.
variety of “open-minded” teaching methods are applied to facilitate the learning process. These include brainstorming, role-playing and adoption of a variety of multimedia tools. In these situations, some individuals should be chosen to help lead the brainstorming and minute-keep the conclusions of each individual pair or group, so as to spread the knowledge to a wider teaching group [28]. Toedter et al. [55] have designed and implemented an EBS teaching schedule to enable all the residents in their programme to develop and refine their EBS skills in a context as close as possible to that in which they will use EBS in their clinical practice. To achieve this, they are given a clinical question (something they might very well be asked by an attending surgeon during rounds) and are asked to demonstrate their competence in finding the best available evidence to answer it. They apply a multi-disciplinary collaborative approach to address “the four steps of EBS”, and each EBS group includes a resident or registrar (junior clinicians), attending surgeon or consultant (senior clinician) and medical school or university librarian. In this context, the senior clinician would lead the formulation of an evidence-based question and also integrate the information in accordance with the best available evidence. The resident or registrar would act as a research coordinator to critically appraise the literature to find the best evidence, and the medical librarian would lead the focused search on the relevant literature. It was demonstrated that evidence-based performance of a resident was related to his or her ability to gather the best evidence in answer to a clinical question (P = 0.011). However, it was also revealed that after additional training, the residents improved their evidence-based skills. It can be concluded that these skills can no longer be limited to academic surgeons, but are to encompass all surgeons universally. Evidence-based concepts will necessarily be required at all levels of surgical education; beginning at the formative years of medical school continued to the end of surgical practice.
2.15 Ethics At a cursory level, EBS seems very straightforward from an ethical perspective; using “best evidence” is literally the optimum strategy to use for our patients
2
Evidence-Based Surgery
Fig. 2.11 Ethics in evidencebased surgery and research based on Stirrat [56] and Burger et al. [57]
23
Informed Consent
Challenges of Distributive Justice
Ethics in Evidence−Based Surgery and Research
What is the most appropriate research design
Patient When should a procedure be formally evaluated?
and applying anything less could be considered as suboptimal. However, with a more in-depth analysis, a number of ethical questions arise when studying the processes involved in EBS (Fig. 2. 11). Two broad concepts need to be addressed: • What qualifies a surgical procedure or technique to have “sufficient” evidence for use? • At what point of introducing a new procedure/technique are we protecting our patients or exposing them to an unknown risk? To answer these questions, the fundamental principles of medical ethics [58, 59] need to be considered. These include beneficence, non-maleficence, autonomy, justice, dignity and truthfulness. Topics arising specifically in EBS [60, 61] (Fig. 2.11) cover issues in informed consent where patients need to be clearly aware of when an operation is for research, based on evidence or based purely on tradition. Furthermore, the reasons for the use of this operation by the responsible surgeon performing it need to be clearly identified. Whether the operation is experimental, new or well-practised, the risk benefit to the patient needs to be specified. If the operation is selected on evidence-based grounds, then the level of evidence needs to be communicated to the patient in a way that he or she understands. Surgeons need to specify the research design of the evidence and discuss its appropriateness. Both surgeons and patients have their own equipoise, and their awareness of surgical choice based on personal biases should be negated in favour of the best evidence and objectivity. There also remains the issue of ethics in a health care world of limited finances and the challenges of distributive justice. What is the cost-effectiveness of
Equipoise
Surgeon
each procedure? Should there be rationing in health care on evidence-based grounds? And how does one address the situation in which the best evidence alludes to an expensive treatment that cannot be afforded by some communities? These considerations should be made by individual surgeons, and also by surgical institutions at both national and international levels. The concept of EBS is to ensure that each patient is treated along the grounds of best knowledge. Application of ethics to this evidence adds compassionate morals to the evidence-based decision-making, which in turn leads to the best possible humane care for patients.
2.16 Conclusion EBS is no longer only about doing randomised control trials and for senior academic surgeons. It is for all surgeons, their colleagues and their patients. It works on the principles that best surgical practice is achieved through best surgical evidence. It is now inevitable and has the potential to address all the primary needs of our patient-oriented surgical practice, namely: • • • • •
Patient management Patient care Patient safety Patient outcomes Patient satisfaction
The steps required from deciding what evidence to find and how to implement changes to best reflect this evidence are illustrated in Fig. 2.12. Many of these steps include clear and logical questioning, dedication in
24
H. Ashrafian et al.
Surgical Question
Collect Information /Evidence
Problem Clarification
Identify need for change
Surgical Research
Development of Treatment Guidelines:
• Reflect on Similar Questions • Rational use of Technology
Compare performance with Standards
Goal orientated Evidence Synthesis
• Local • National • International
Ethics
Budget
Values
Judgement Analysis for Optimal Decision
Local context
Preferences
Monitor effect Decision
Ensure adequate training & education
Implement Change
Fig. 2.12 Evidence-based surgery algorithm
pursuit of excellence and ultimately a culture in which best-evidence is not a bonus, but rather a fundamental requirement. This cannot be done simply by individuals, but requires teamwork at all levels of health care, from local to national and international. Furthermore, EBS is not a one-way process, but requires reassessment, revision, repeated searches and constant updating to reflect the new advances in surgical evidence. As surgeons, it is not only our duty to contribute to the momentum of evidence-based practice, but actually a
necessity for us to lead in many of these strategies. This requires universal training and re-training in evidence-based methods to reach a level of understanding that would place best-evidence at the heart of our surgical careers. For the vast majority of surgeons worldwide, the traditional concept of hand-hygiene before operating is now intuitive and “second nature” to them. For the next generation of surgeons, it would also be ideal to consider “best evidence” instinctively before coming to make any surgical decision.
2
Evidence-Based Surgery
References 1. Brater DC, Daly WJ (2000) Clinical pharmacology in the Middle Ages: principles that presage the 21st century. Clin Pharmacol Ther 67:447–450 2. Evidence Based Medicine Working Group (1992) Evidencebased medicine. A new approach to teaching the practice of medicine. JAMA 268:2420–2425 3. Darzi A (2008) High quality care for all: NHS next stage review final report. Department of Health, London 4. Birch DW, Eady A, Robertson D et al (2003) Users’ guide to the surgical literature: how to perform a literature search. Can J Surg 46:136–141 5. Jacklin R, Sevdalis N, Darzi A et al (2008) Efficacy of cognitive feedback in improving operative risk estimation. Am J Surg 197:76–81 6. Oxford Centre for Evidence-based Medicine (2001) Levels of evidence. Available at: http://www.cebm.net/index.aspx? o = 1025 7. Sackett DL, Straus SE, Richardson WS et al (2000) Evidence-based medicine: how to practice and teach EBM. Churchill Livingstone, London 8. Eddy DM (2005) Evidence-based medicine: a unified approach. Health Aff (Millwood) 24:9–17 9. Pang T, Gray M, Evans T (2006) A 15th grand challenge for global public health. Lancet 367:284–286 10. NHS (2008) National knowledge Service (of the National Health Service, United Kingdom). Available at: http://www. nks.nhs.uk/ 11. American College of Surgeons (2008) Continuous quality improvement. Available at: http://www.facs.org/cqi/index.html 12. Jones RS, Richards K (2003) Office of Evidence-Based Surgery: charts course for improved system of care. Bull Am Coll Surg 88:11–21 13. Eddy DM (1992) A manual for assessing health practices and designing practice policies: the explicit approach. American College of Physicians, Philadelphia 14. Nelson R, Edwards S, Tse B (2007) Prophylactic nasogastric decompression after abdominal surgery. Cochrane Database Syst Rev:CD004929 15. Porter ME, Teisberg EO (2007) How physicians can change the future of health care. JAMA 297:1103–1111 16. Neuhauser D (1990) Ernest Amory Codman, M.D., and end results of medical care. Int J Technol Assess Health Care 6:307–325 17. Wennberg J (2008) Commentary: a debt of gratitude to J. Alison Glover. Int J Epidemiol 37:26–29 18. DeBakey ME (1947) Military surgery in World War II: a backward glance and a forward look. N Engl J Med 236: 341–350 19. Cochrane AL (1972) Effectiveness and efficiency: random reflections on health services. Nuffield Provincial Hospitals Trust, London 20. Kouchoukos NT, Ebert PA, Grover FL et al (1988) Report of the Ad Hoc committee on risk factors for coronary artery bypass surgery. Ann Thorac Surg 45:348–349 21. Horton R (2004) A statement by the editors of the lancet. Lancet 363:820–821 22. McCulloch P, Taylor I, Sasako M et al (2002) Randomised trials in surgery: problems and possible solutions. BMJ 324:1448–1451
25 23. Jacklin R, Sevdalis N, Harries C et al (2008) Judgment analysis: a method for quantitative evaluation of trainee surgeons’ judgments of surgical risk. Am J Surg 195:183–188 24. McCulloch P, Badenoch D (2006) Finding and appraising evidence. Surg Clin North Am 86:41–57; viii 25. University of Virginia Health System Navigating the maze: obtaining evidence-based medical information (2009) Available at: http://www.hsl.virginia.edu/collections/ebm/overview.cfm 26. ISCP (2005) The New Intercollegiate Curriculum for Surgical Education. Intercollegiate Surgical Curriculum Project, London. (http://www.iscp.ac.uk/) 27. Marshall JC (2006) Surgical decision-making: integrating evidence, inference, and experience. Surg Clin North Am 86:201–215; xii 28. Sevdalis N, McCulloch P (2006) Teaching evidence-based decision-making. Surg Clin North Am 86:59–70; viii 29. OpenClinical (2000) The medical knowledge crisis and its solution through knowledge management (White Paper), London 30. Glickman SW, Baggett KA, Krubert CG et al (2007) Promoting quality: the health-care organization from a management perspective. Int J Qual Health Care 19:341–348 31. Ioannidis JP (2006) Indirect comparisons: the mesh and mess of clinical trials. Lancet 368:1470–1472 32. Salanti G, Kavvoura FK, Ioannidis JP (2008) Exploring the geometry of treatment networks. Ann Intern Med 148:544–553 33. National Health and Medical Research Council (1999) A guide to the development, evaluation and implementation of clinical practice guidelines. Available at: http://www.nhmrc. gov.au/publications/synopses/_files/cp30.pdf 34. National Institute for Health and Clinical Excellence (2008) Moving beyond effectiveness in evidence synthesis – methodological issues in the synthesis of diverse sources of evidence. Available at: http://www.nice.org.uk/niceMedia/docs/ Moving_beyond_effectiveness_in_evidence_synthesis2.pdf 35. Greenland S (2006) Bayesian perspectives for epidemiological research: I. Foundations and basic methods. Int J Epidemiol 35:765–775 36. Spiegelhalter DJ, Myles JP, Jones DR et al (2000) Bayesian methods in health technology assessment: a review. Health Technol Assess 4:1–130 37. Pope C, Mays N, Popay J (2007) Synthesizing qualitative and quantitative health research: a guide to methods. Open University Press, Maidenhead 38. Ragin CC (1992) The comparative method: moving beyond qualitative and quantitative strategies. University of California Press, Berkeley 39. Kirk RM, Mansfield AO, Cochrane JPS (1999) Preface. In: Kirk RM, Mansfield AO, Cochrane JPS (eds) Clinical surgery in general. Churchill Livingstone, London 40. Brunswik E (1952) The Conceptual framework of psychology. University of Chicago Press, Chicago 41. Cooksey RW (1996) Judgment analysis: theory, methods, and applications. Academic Press, San Diego 42. Cooksey RW (1996) The methodology of social judgement theory. Think Reason 2:141–173 43. Sevdalis N, Jacklin R (2008) Opening the "black box" of surgeons' risk estimation: from intuition to quantitative modeling. World J Surg 32:324–325 44. MacCormick AD, Parry BR (2006) Judgment analysis of surgeons’ prioritization of patients for elective general surgery. Med Decis Making 26:255–264
26 45. Kee F, McDonald P, Kirwan JR et al (1997) The stated and tacit impact of demographic and lifestyle factors on prioritization decisions for cardiac surgery. QJM 90:117–123 46. Clarke MG, Wilson JR, Kennedy KP et al (2007) Clinical judgment analysis of the parameters used by consultant urologists in the management of prostate cancer. J Urol 178:98–102 47. Pfister M, Jakob S, Frey FJ et al (1999) Judgment analysis in clinical nephrology. Am J Kidney Dis 34:569–575 48. Denig P, Wahlstrom R, de Saintonge MC et al (2002) The value of clinical judgement analysis for improving the quality of doctors’ prescribing decisions. Med Educ 36: 770–780 49. Levit K, Smith C, Cowan C et al (2004) Health spending rebound continues in 2002. Health Aff (Millwood) 23: 147–159 50. Heffler S, Smith S, Keehan S et al (2003) Health spending projections For 2002–2012. Health Aff (Millwood) Suppl (Web Exclusives):W354–W365 51. Gold MR, Siegel JE, Russell LB et al (1996) Costeffectiveness in health and medicine. Oxford University Press, New York 52. Gazelle GS, McMahon PM, Siebert U et al (2005) Costeffectiveness analysis in the assessment of diagnostic imaging technologies. Radiology 235:361–370
H. Ashrafian et al. 53. Saha S, Hoerger TJ, Pignone MP et al (2001) The art and science of incorporating cost effectiveness into evidencebased recommendations for clinical preventive services. Am J Prev Med 20:36–43 54. Guyatt GH, Oxman AD, Kunz R et al (2008) Incorporating considerations of resources use into grading recommendations. BMJ 336:1170–1173 55. Toedter LJ, Thompson LL, Rohatgi C (2004) Training surgeons to do evidence-based surgery: a collaborative approach. J Am Coll Surg 199:293–299 56. Stirrat GM (2004) Ethics and evidence based surgery. J Med Ethics 30:160–165 57. Burger I, Sugarman J, Goodman SN (2006) Ethical issues in evidence-based surgery. Surg Clin North Am 86:151–168; x 58. Coughlin SS, Beauchamp TL (1992) Ethics, scientific validity, and the design of epidemiologic studies. Epidemiology 3:343–347 59. Weijer C, Dickens B, Meslin EM (1997) Bioethics for clinician: 10. Research ethics. CMAJ 156:1153–1157 60. Stirrat GM (2004) Ethics and evidence based surgery. J Med Ethics 30:160–165 61. Burger I, Sugarman J, Goodman SN (2006) Ethical issues in evidence-based surgery. Surg Clin North Am 86:151–168
3
The Role of the Academic Surgeon in the Evaluation of Healthcare Assessment Roger M. Greenhalgh
Contents 3.1
Introduction ............................................................
27
3.2
Clinical Practice .....................................................
28
3.3
Training Programme..............................................
28
3.4
Advance of Subject and Historical Perspective ..............................................................
28
3.5
Health Technology ..................................................
30
3.5.1 3.5.2 3.5.3 3.5.4 3.5.5
Clinical Trial Expertise ............................................ Statistical Knowledge............................................... Health Economics .................................................... Cost Effectiveness Modelling .................................. Health-Related Quality of Life and Patient Preference ..............................................
30 31 31 31
References ...........................................................................
32
32
Abstract The academic surgeon needs to have much energy and be intent on moving the subject forwards. It is first necessary to set up a fine regional facility and integrate a regional training programme. This is merely the beginning as the academic surgeon must have historical perspective and know where the subject is in historical terms and where it is likely to go next. There are clues as to how this can be anticipated, as it is explained in this chapter. The surgeon then needs to integrate with a multidisciplinary team to bring the subject forward. The team will have clinical trial expertise, statistical knowledge and cost effectiveness opportunities. The opinion of the patient is always paramount and must be measured. These issues bring more benefit to more patients the world over than a single well-performed operation. Thus, the academic surgeon must be a great surgeon and a humble coordinator of disciplines in the patient interest.
3.1 Introduction
R. M. Greenhalgh Division of Surgery, Oncology, Reproductive Biology & Anaesthetics, Imperial College, Charing Cross Hospital, Fulham Palace Road, London W6 8RF, UK e-mail:
[email protected] Should I go into an academic career in surgery? What can I expect, and am I suited for it? For the right person, it is the most exhilarating experience. For others, the mere expectation of “research” is anathema. “Let me get on with the practice of surgery! This is what professors hear from some. A trainee spends years satisfying academic requirements before being able to consider becoming a doctor and then years more training in surgery. Is that not enough of the studying load? The problem is that the subject is not finite, in the sense that there are no agreed boundaries of minimal knowledge required to practise as a surgeon. When we face the uphill of knowledge base needed to reach the standard to practise surgery, we reasonably ask which books have what is
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_3, © Springer-Verlag Berlin Heidelberg 2010
27
28 Clinical practice & service
Health technology
R. M. Greenhalgh Training the next generation
Subject advance
Fig. 3.1 The life of the academic surgeon is summarised in these four orbits of activity which are interlinked
required for the purpose. The inference is that, once read and understood, that is enough! It often gets candidates through assessment hurdles, but then comes the “rude awakening” that the subject moves on. Fortunately, there are some who are not put off by subject advance. Such surgeons in training actually like their research attachment. What is it that they like? Undoubtedly, some “pretend” to enjoy research, and believe it will hasten their promotion! They “bite the bullet” and get the research over and move on or undergo a career change. There are some who love the experience of research, not so much the hard grind as the analysis and the delivery of results on the podium. Some young surgeons are born to be “on the podium”. This seems to be a stimulus for some, and these are to be the good undergraduate and postgraduate teachers. They are a constant joy for an academic group. There are, then, some who are inclined towards a career in academic surgery, and what is involved long term? I recognise a life of four orbits of activity as they are outlined in (Fig. 3.1).
a good team, and referrals will follow. It is vital at the beginning of the setting up of the group to define the clinical area carefully and to work industriously to achieve very good documentation of patients from the very start. At the beginning of my professorial life, I found that the hospital notes were in an appalling state, and I determined to design specific documentation sheets that would be required for the speciality. This was vital so that colleagues worked almost by protocol in that they answered set questions of history and recorded specific clinical findings. This is no substitute for prospective data, but it helps enormously to reduce data gaps, and most importantly, it points the trainee in the best direction for optimal patient management and documentation. Every link with the regional hospitals is worthwhile, and it can be an advantage to set up rotations of junior staff to a variety of regional hospitals. This achieves a number of objectives. Firstly, it provides an easy route for referral of a patient to a specialist service. Secondly, it improves relationships between colleagues at the hospitals to patient benefit by aiding transfer when needed. Thirdly, it is good for training programmes for trainees to have a wide variety of types of environment to work.
3.3 Training Programme The academic surgeon has a responsibility towards the trainees of the region. The academic head is not necessarily the person who arranges the rotations, but must be very aware of the training issues and the conditions of work of the trainees. Each trainee must have a clear aim. Each must have a goal and a good idea about what type of surgeon they wish to be in the end. What type of place do they wish to work? Will they teach? Will they have a postgraduate responsibility? Will they practise research? A good matching process is vital.
3.4 Advance of Subject and Historical Perspective 3.2 Clinical Practice The chief of an academic group must first be certain that there is a high throughput of patients, and the best way to achieve this is to perform a speciality well with
Whichever branch of academic surgery is chosen, it is vital to aim to advance the subject, and to do that, it is necessary to understand the basic science, which is the basis of the subject. For example, it would be
3
The Role of the Academic Surgeon in the Evaluation of Healthcare Assessment
Fresh air – sanatoria
Surgical pioneering era
Lesser interventions
Medical non-operative management
Prevention
Fig. 3.2 Surgical advance passes through defined phases – the pattern applies in many instances
incomprehensible for a vascular surgeon to fail to understand the basic vascular biology of the subject. It is similar for every branch of academic surgery, and a thorough grasp of the basic subject is essential. Why is this? It is so because an academic surgeon must have historical perspective. A moment of contemplation will indicate that subject advance frequently passes through distinct phases (Fig. 3.2). I will give examples to show what I mean as this is a vital message for the academic surgeon to perceive the relentless advance and not be taken by surprise. Rather, he must drive advance. In the 1950s, as a child, I was aware of a dreadful condition known in lay terms as “galloping consumption”, which I later recognised as tuberculosis, so frequently of the lung but also the bone. I was musical and hated that Mozart died so young. Children with tuberculosis from Mozart’s time till my childhood were taken to sanatoria where the air was fresh. There the young child would be hospitalised for years in a so-called “sanatorium”, one for an ankle, two for a knee, three for a hip and five for a spine. Years in hospital – for what? “Fresh air and rest”! The temples to Aesclepius, the healing god in the ancient Greek world, were always built where there is raised ground and a gentle wind for the healing process. Sanatoria were, thus, always found on a hill where the air is good.
29
This habit continued until recent years. Thus, the Empress, Sissi of Austria, wife of Kaiser Franz-Josef was sent, by her doctors, to Corfu to recover, and “fresh air” was all that the doctors could offer with rest, rest and more rest. This state persisted till well after the Second World War. Then, suddenly, the surgical era came and dramatic surgery was prescribed. For tuberculosis of the lung, thoracoplasty was introduced and crushing of the phrenic nerve to paralyse the diaphragm. Thus, by the old principle of localised rest, the body had its only chance of healing. In around 1956, Streptomycin [1] which was the first antibiotic found to be active against tuberculosis was introduced, and in no time, the sanatoria started to empty, and the surgical procedures were abandoned. At the time, there were trained thoracic surgeons available, and suddenly, there was less thoracic surgery required and depression set in for the speciality. It is a very sad sight to see a trained and disgruntled surgeon who has been sidelined as a result of change in the advancement of treatment. This naturally incites some older surgeons to be “conservative” by nature, but it is better to predict change, to drive inevitable change, rather than resist what is obviously better for the patient. For tuberculosis, this was not the end, and very shortly, the physicians became relatively superfluous because a programme of tuberculin testing of cows was commenced in the community, and those young people who had never had the disease were inoculated; so, by prevention in the community, the condition was virtually eradicated at that time. Which phase does the patient like best? I will leave that to the reader but suffice it to say that surgical dominance and patient needs do not necessarily go hand in hand! Another more recent example is the management of peptic ulceration. After years of admission to a “lying in bed” for rest only, the surgical era brought great relief with such mammoth operations as Polya and Billroth gastrectomy [2]. This carried a significant mortality and was associated with the post-operative complications of “dumping”. However, patients still queued up for gastrectomy as they were promised relief of symptoms. Lesser operations followed, in particular, vagotomy [3], which was commonplace in the 1970s. The extent of the procedure was reduced in the version known as “highly selective vagotomy” in which the nerve of Latarget was preserved and with it, the function of the pylorus. So we had “surgery without dumping and surgery without diarrhoea” [4, 5]. Here was the expected
30
R. M. Greenhalgh
Clinical trial expertise
Patient prefernce
Health related quality of life
Statistical knowledge
Health economics
Cost effectiveness modelling
Fig. 3.3 Some of health technology areas of vital importance
refinement of major surgery which no patient ever wanted to face, and less invasive surgery, more patient friendly, was inevitably introduced and was popular compared with the larger procedures. It was not to end here. Very soon, came in the drugs, which switched off the acid pump, and how were these designed? By knowledge of an understanding of the digestive process, many academic surgeons, who had a Scottish background, brought about this advance. Again, prevention was the final stage once it was better understood which patients get peptic ulceration and so how a preventative approach could be used. An example of this is to be found in the intensive care situation when peptic bowel perforation is common if unprevented. Today, it is prevented by the timely use of drugs to turn off the acid pump. The historical perspective is relevant to every disease situation and every branch of surgery. It helps to be able to “step back” and to see where the subject is now and so where it will go next. It will not stay as it is. You drive it or be left behind!
3.5 Health Technology We have thus far considered the role of the academic surgeon and what skills he needs to be in a position to start the evaluation of healthcare assessment. He now is ready to commence this demanding need. How to set about it?
As is clear from the above, there is much more than the pure subject of surgery in the evolution of all managed conditions, and advancement in current management needs a team of skilled experts to work together. The surgeon is but a cog in the machine and many parts are required. Sometimes it falls to the surgeon to convene such a group and at least, he needs to be part of one. Gone are the days when an academic surgeon would have his laboratory for trainee surgeons to “do a bit of lab work for a thesis”. It is an era past and rightly so. Why is this? It is because subject advance can only be achieved with multi-disciplinary skills, and the group will need the full range of “health technology assessment”. This is summarised in Fig. 3.3.
3.5.1 Clinical Trial Expertise When I became an academic surgeon, this skill was not defined and so not available. I realised the need to retrain and work with clinical trial experts when I saw what the results of large trials did for clinical practice. I will give an example but there are many. In the 1980s, in the United States, it was alleged by vociferous neurologists that surgeons were slashing open necks willy-nilly to operate on the carotid artery and there was not a shred of evidence to support the intervention. The Society for Vascular Surgery took serious umbrage, but Dr Henry Barnett, neurologist in London, Ontario, Canada, was right. He implied that physicians need a better justification even before being allowed to prescribe drugs, let alone to operate on a neck as surgeons did! This provoked neurologists and surgeons to put the operation to the ultimate test. A multi-centre trial was organised and symptomatic patients were entered into the North American Symptomatic Carotid Endarterectomy Trial [1, 6] and a similar European Carotid Surgical Trial lead by Charles Warlow started almost simultaneously [2, 7]. Critically, both trials were set up with surgeons and neurologists working together. A clinical alert from NASCET was released after 18 months to the effect that surgical patients did vastly better than non-surgical patients in a randomised trial in which both groups were as near as possible, identical but treated differently. There was a 17% benefit for surgery with best medical treatment over best medical treatment alone. At 18 months, the trial was stopped. The European trial showed exactly similar findings, and the power of the
3
The Role of the Academic Surgeon in the Evaluation of Healthcare Assessment
two taken together was to inform the whole world what to do in the circumstances of the so-called “transient ischaemic attacks” and “mini strokes”. The operation had been with us since 1953, but it took so long to prove it worked! I had performed carotid surgery for years and remember being sad that it was challenged by Barnett, but he was right. I then witnessed a massive increase in referrals for the operation as doctors felt it was the right course of action to take in the patient interest. I had seen the power of clinical trials and needed to learn how to do them. There are times when a randomised controlled trial is not possible and other techniques may then be used such as case–control studies and longitudinal cohort studies. These will be described elsewhere in this book. It is important for the academic surgeon to have a grasp of the “levels of evidence” quoted elsewhere and the technique of meta-analysis, a most useful function. The concept of the Cochrane review is also vital to understand.
3.5.2 Statistical Knowledge To begin a randomised controlled trial, a statistician is required very soon. At first, we made the mistake of turning to a statistician at the end of some documentation of procedures. We might say, “Make some sense of this”, and the answer should not be printed! It is crucial to work with a statistical group in the design and setting up of the trial. There are a number of ways to approach a problem. Some statisticians favour large groups and preferably only two groups. Others favour “propensity analysis” and many deplore subgroup analysis. In general, it is a good rule to discuss and agree the statistical plan before any data are collected and certainly before data are analysed for fear of introducing bias. The statistician will also perform “power calculations” which are aimed at calculating the numbers required for statistical significance, given certain assumptions. Very occasionally, a surgeon becomes involved in the development of a new method, for example, the “tracker trial” concept [8]. This is a randomised controlled trial of a comparison of “generic treatments” and subgroups are expressed as a percentage of the alternative generic method and the proportions are compared one with another.
31
The so-called endovascular aneurysm repair (EVAR) trials were cited in this methodology. In this, all endovascular devices were compared with open repair as two generic treatments. Then, different types of EVAR devices were compared with the whole open repair group and this is repeated for each EVAR type. Finally, the results of one EVAR as related to open repair are compared with the performance of another specific EVAR type compared with open repair [9–11]. It is also possible to use a “propensity analysis” and we actually presented the results that way [12].
3.5.3 Health Economics It soon became clear that the purchasers of health care must have a grasp of the costs of various methods. Thus, it is absolutely vital to have a health economic group working in the multi-disciplinary team. Wherever there is more than one way to treat, cost comes into it. Many of the costs are hidden costs. Some are obvious, others less so. For example, it was shown in Vienna that the cost of aortic aneurysm repair is determined by three factors, which account for 80% of the costs [13]. These are costs in the operating room, use of intensive or critical care and length of stay. Re-interventions and follow-up scans in hospital are another big cost. Clinicians think they can guess at costs but they cannot. It takes special expertise. I would go so far as to say that every procedure comparison today requires a cost analysis. Journals do not always want the results. I find it best to include economic details with clinical results and not to separate them. This avoids the rejection of cost details by some editors.
3.5.4 Cost Effectiveness Modelling This is an attempt to gaze into the future. It is needed by health care purchasers and they want to have an early prediction of what it costs to adopt a new treatment over the years ahead. Of course, to know the actual answer, time must elapse and so “assumptions” are made to “model” the result. If a treatment is very effective clinically, the cost could be good “value for money”. The poorer the clinical benefit, the less likely the procedure to prove cost effective. These skills are relatively recently described and not as widely available as they
32
should be, but no multi-disciplinary group is complete without these skills. The so-called “Markov” model is commonly used.
3.5.5 Health-Related Quality of Life and Patient Preference It is not all about cost. It is crucial to be sensitive as to what is best for the patient and at what cost. Patient satisfaction is difficult to measure, but it is vital to assess it. There is a tendency to be obsessed with cost as this determines whether organisations decide to buy the method, but what the patient thinks must be the key issue. There are established methods in commerce, which deal with quality of life (QoL), and the “health related” (HRQoL) is a mere extension of this. Opinion polls are well known to politicians, and these techniques can be applied to patients with suitable modification. Thus, patient feedback is assessed formally, but the methodology of this is a new skill, which must be learned and applied.
References 1. Anon (1991) North American Symptomatic Carotid Endarterectomy Trial. Methods, patient characteristics, and progress. Stroke 22:711–720 2. Anon (1991) MRC European Carotid Surgery Trial: interim results for symptomatic patients with severe (70–99%) or
R. M. Greenhalgh with mild (0–29%) carotid stenosis. European Carotid Surgery Trialists’ Collaborative Group. Lancet 337:1235–1243 3. Billroth T (1881) [Reports on Billroth’s operation]. Wien Med Wochenschr 31:595–634 4. Brown LC, Greenhalgh RM, Kwong GP et al (2007) Secondary interventions and mortality following endovascular aortic aneurysm repair: device-specific results from the UK EVAR trials. Eur J Vasc Endovasc Surg 34:281–290 5. Dragstedt LR, Owens FMJ (1943) Supradiaphragmatic section of the vagus nerves in treatment of duodenal ulcer. Proc Soc Exp Biol Med 53:152–154 6. Greenhalgh RM, Brown LC, Kwong GP et al (2004) Comparison of endovascular aneurysm repair with open repair in patients with abdominal aortic aneurysm (EVAR trial 1), 30-day operative mortality results: randomised controlled trial. Lancet 364:843–848 7. Greenhalgh RM, Brown LC, Kwong GP et al (2005) Endovascular aneurysm repair versus open repair in patients with abdominal aortic aneurysm (EVAR trial 1): randomised controlled trial. Lancet 365:2179–2186 8. Greenhalgh RM, Brown LC, Kwong GP et al (2005) Endovascular aneurysm repair and outcome in patients unfit for open repair of abdominal aortic aneurysm (EVAR trial 2): randomised controlled trial. Lancet 365:2187–2192 9. Holzenbein J, Kretschmer G, Glanzl R et al (1997) Endovascular AAA treatment: expensive prestige or economic alternative? Eur J Vasc Endovasc Surg 14:265–272 10. Humphrey CS, Johnston D, Walker BE et al (1972) Incidence of dumping after truncal and selective vagotomy with pyloroplasty and highly selective vagotomy without drainage procedure. Br Med J 3:785–788 11. Johnston D, Humphrey CS, Walker BE et al (1972) Vagotomy without diarrhoea. Br Med J 3:788–790 12. Keller H, Krupe W, Sous H et al (1956) [Raising of tolerance to streptomycin by the introduction of pantothenic acetates]. Wien Med Wochenschr 106:63–65 13. Lilford RJ, Braunholtz DA, Greenhalgh R et al (2000) Trials and fast changing technologies: the case for tracker studies. BMJ 320:43–46
4
Study Design, Statistical Inference and Literature Search in Surgical Research Petros Skapinakis and Thanos Athanasiou
Contents 4.1
The Basics of Study Design....................................
33
4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.1.6
Ecological or Aggregate Studies .............................. Cross-Sectional Surveys ........................................... Case–Control Studies ............................................... Cohort or Longitudinal Studies ................................ Randomised Controlled Trials ................................. Systematic Reviews and Meta-Analyses..................
33 34 35 35 36 38
4.2
The Basics of Statistical Analysis ..........................
39
4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8
The Study Population ............................................... Hypothesis Testing ................................................... Type 1 and Type 2 Errors ......................................... Statistical Power ....................................................... Interpreting “Statistically Significant” Results ........ Confidence Intervals................................................. Interpreting “Negative” Results ............................... Correlation and Regression ......................................
39 39 39 40 40 40 40 40
4.3
Causal Inference .....................................................
41
4.3.1 What Is a Cause? ...................................................... 4.3.2 The Multi-Factorial Model of Causation ................. 4.3.3 Evaluating Causality in the Multi-Factorial Model ....................................................................... 4.3.4 Bradford Hill’s Criteria for Causality ......................
41 41
4.4
42 43
Clinical Importance of the Results: Types of Health Outcomes and Measures of the Effect .............................................................
44
4.4.1 Health Outcomes ...................................................... 4.4.2 Clinical Importance ..................................................
44 44
4.5
Searching Efficiently the Biomedical Databases ................................................................
46
4.5.1 Structure of a Database ............................................ 4.5.2 Structure of PubMed ................................................
46 46
References ...........................................................................
53
P. Skapinakis () University of Ioannina, School of Medicine, Ioannina 45110, Greece e-mail:
[email protected] Abstract The aim of this chapter is to provide the reader with the theoretical skills necessary to understand the principles behind critical appraisal of the literature. The presentation will follow roughly the order by which a researcher carries out the research. First, we discuss the main types of study design. Second, we briefly mention the basic statistical procedures used in data analysis. Third, we discuss on the possible interpretations of an observed association between an exposure and the outcome of interest, including any causal implications. Fourth, we discuss the issue of clinical significance and distinguish it from statistical significance by referring to the types of outcomes used in research and the measures of the effect of a potential risk factor. Finally, we give practical advice to help readers increase their ability to search efficiently in the biomedical databases, and especially medline.
4.1 The Basics of Study Design The main study designs used in research (see Table 4.1) can be described as either observational (ecological, cross-sectional, case–control and cohort studies), experimental (mainly the randomised controlled trial – RCT) or summary in nature (systematic reviews and meta-analyses) [1–3].
4.1.1 Ecological or Aggregate Studies Ecological studies examine the association between disease (or the outcome of interest) and the characteristics of an aggregation of people, rather than the characteristics of individuals. The main difficulty with this design
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_4, © Springer-Verlag Berlin Heidelberg 2010
33
34
P. Skapinakis and T. Athanasiou
Table 4.1 Types of study in epidemiological research Types of study Primary research
The graph (Fig. 4.1) shows the generally positive association between statin prescribing rates and episodes of CHD. However, there are differences between different primary care trusts and it is now known that these differences are explained by the different needs of the individual patients attending the particular practices.
Secondary research
Observational
Experimental
Summary
Ecological Cross-sectional Case–control Cohort
Randomised controlled trials (RCTs)
Systematic reviews Meta-analyses
4.1.2 Cross-Sectional Surveys is that the association between exposure and disease at an aggregate level may not be reflected in an association at the individual level. In this context, the confounding is often termed the ecological fallacy. An example of an ecological study is the one conducted by Ward et al. [4], which aimed at examining the association between statin prescribing rates by general practitioners and several proxies of health care need, including hospital morbidity statistics for coronary heart disease (CHD) episodes.
This type of descriptive study relates to a single point in time and can therefore report on the prevalence of a disease, but is adversely affected by the duration of illness. A cross-sectional survey eliminates the problems of selection bias and has been frequently used for the study of common conditions. However, any association found in a cross-sectional survey could either be with incidence or duration. For example, Skapinakis et al. [6]
200.00
CHD Hospital Episode Statistics Rate
x 150.00
x x
x x
x
100.00 x
50.00
x
xx x
x xx
x
x
x x x xx x
x
x x
x
x
x
x xx
x
x x
x x
x x
x x x
0.00
Fig. 4.1 Ecological study of the association between Statin prescribing rates and coronary heart disease hospital episode statistics in four primary care trusts in the UK (from Ward et al. [5])
10.00
20.00
30.00
Statin Prescribing Rates
x
PCT1 PCT2 PCT3 PCT4
40.00
50.00
4
Study Design, Statistical Inference and Literature Search in Surgical Research
studied the sociodemographic and psychiatric associations of chronic fatigue syndrome in a cross-sectional survey of the general population in Great Britain. They found that chronic fatigue syndrome was strongly associated with depression. They also found that other risk factors were independently associated with chronic fatigue (older age, female sex, having children and being in full-time employment) after adjustment for depression. This finding supported the hypothesis that chronic fatigue syndrome has a unique epidemiological profile that is distinct from depression, but longitudinal studies should explore this hypothesis further.
4.1.3 Case–Control Studies In a case–control study, individuals with the disease (cases) are compared with a comparison group of controls. If the prevalence of exposure is higher in the cases than in the controls, the exposure might be a risk factor for the disease, and if lower, the exposure might be protective. Case–control studies are relatively cheap and quick, and can be used to study rare diseases. However, great care is needed in the design of the study in order to minimise selection bias. It is important to ensure that the cases and controls come from the same population, because the purpose of the “control” group is to give an unbiased estimate of the frequency of exposure in the population from which the cases are drawn. For example, Kendell et al. [7] conducted a case–control study to examine the association between obstetric complications (OCs) and the diagnosis of schizophrenia. They found a highly significant association and concluded that a history of OCs in both pregnancy and delivery is a risk factor for developing schizophrenia in the future. However, in a new paper [8], the same group re-analyzed the data set of the previous study and reported that the previous findings were not valid due to an error in selecting controls. The method used had inadvertently selected controls with lower than normal chances of OCs, thereby introducing a serious selection bias. In reality, there was no association between schizophrenia and OCs in this data set. A nested case–control study is one based within a cohort study or sometimes a cross-sectional survey. The cases are those that arise as the cohort is followed prospectively and the controls are a random sample of the non-diseased members of the cohort [3].
35
In a matched case–control study, one or more controls are selected for each case to be similar for characteristics that are thought to be important confounders. The analysis of case–control studies results in the reporting of odds ratios (ORs); case–control studies cannot directly estimate disease incidence rates. If the study is matched, a more complex matched analysis needs to be performed (conditional logistic regression).
4.1.4 Cohort or Longitudinal Studies A cohort (or longitudinal, or follow-up) study is an observational study in which a group of “healthy” subjects who are exposed to a potential cause of disease, together with a “healthy” group who are unexposed, are followed up over a period of time. The incidence of the disease of interest is compared in the two groups. Ideally, the exposed and unexposed groups should be chosen to be virtually identical with the exception of the exposure. The ability of a cohort study to rule out reverse causality as a reason for an observed association is of great benefit. One of the most well-known cohort studies in Medicine is the Framingham Heart Study. From this study, Wilson et al. [9] followed up 2,748 participants aged 50–79 for 12 years and reported in 1988 that low levels of high density lipoprotein cholesterol (HDL-C) were associated with increased mortality, especially from CHD or other cardiovascular causes. Cohort studies always “look forward” from the exposure to disease development, and therefore, can be time-consuming and expensive. To minimise costs, historical data on exposure, i.e. information already collected, can be used. The disadvantage of these studies is that exposure measurement is dependent on the historical record that is available. The completeness of follow-up is particularly important in cohort studies. It is essential that as high a proportion of people in the cohort as possible are followed up and those who migrate, die or leave the cohort for any reason should be recorded. The reasons for leaving the cohort may be influenced by the exposure and/or outcome, and incomplete follow-up can therefore introduce bias. The analysis of cohort studies involves calculation of either the incidence rate or risk of disease in the exposed cohort compared to that in the unexposed cohort. Measures of relative and absolute measures of effect can then be calculated.
36
P. Skapinakis and T. Athanasiou
4.1.5 Randomised Controlled Trials RCTs (Fig. 4.2) are most frequently used (when possible) to investigate the effectiveness of medical interventions [10]. They are the strongest design to investigate causality between an intervention and outcome because randomly allocating sufficient patients to two or more treatments should eliminate both selection bias and confounding when comparing outcomes [11]. Selection bias and confounding are explained later, but the principle of the RCT is that the subjects in the randomised groups should be as similar as possible. The main argument for randomisation is that it is impossible to measure all the potential confounding variables that may affect outcome. If one could predict outcome very accurately, a longitudinal study would be a satisfactory design. If an RCT is to influence clinical practice, it must address an area of clinical uncertainty. If there is a consensus that a treatment is effective, then there is little point in conducting a trial without some other good reasons. The more common the dilemma, the more important and relevant becomes an RCT. It is important that we recognise areas of clinical uncertainty in
our work in order to inform the design of future RCTs. Clinical uncertainty is also related to the ethical justification for randomisation. If a clinician is uncertain about the most effective treatment, then randomisation becomes an ethical option or even an ethical requirement. It is therefore important that RCTs address the important clinical dilemmas. Subjects must be allocated to the treatments in an unbiased way. This is done by concealing the process of randomisation, so that the person who has assessed the patient cannot interfere with the randomisation. The concealment of randomisation is an important aspect of RCT methodology and has been used as a proxy for the quality of an RCT [12]. The validity of the comparison between the randomised groups in an RCT depends critically on ensuring that the measurement of outcome is not affected by the allocation of treatment. This is usually done by disguising the random allocation from the person making the assessment or “blinding” the person as to the allocation. A double-blind study refers to one in which both the patient and assessor are blind. A triple-blind study refers to those in which the person analysing the data is also unaware of the treatment allocation.
Selection Criteria
Source Population
Eligible Population
Consenting to Randomization
Treatment
Outcome known
Fig. 4.2 Design of RCT
Non Participants
Did not consent
Control
Outcome unknown
Outcome known
Outcome unknown
4
Study Design, Statistical Inference and Literature Search in Surgical Research
One of the main difficulties in interpreting the results of an RCT concerns the influence of subjects withdrawing from treatment or from follow-up. As subjects drop out of an RCT, the treatment groups depart from the balanced groups created at randomisation. If the drop-outs are substantial in number, then there is a possibility that confounding is reintroduced. Even more importantly, since non-compliers usually tend to be those subjects at a higher risk of adverse health outcomes, there is a risk of bias creeping in, especially if there is differential drop-out between the groups. Therefore it is important to minimise the noncompliance rate and loss to follow-up rate. The main way in which this problem is circumvented is by the use of an intention-to-treat strategy in the analysis in which all the randomised subjects are included irrespective of whether they continued with the treatment or not. If there is missing follow-up data, one can use data from a previous time-point, assuming a poor outcome for drop-outs. There are also more complex ways of substituting values for missing data that rely upon multivariate methods. An intention to treat strategy ensures that all the randomised individuals are used in the analysis. In this way, the benefits of randomisation are maintained and the maximum number of subjects can be included in the analysis. Using an intention-to-treat analysis is one of the characteristics of pragmatic trials [10]. They aim to study the long-term consequences of one clinical decision, e.g. to prescribe the treatment or not, and to follow best clinical practice after that. The treatment effect may be less (i.e. the effect is diluted) than in the ideal case of 100% compliance, but it gives a far more realistic estimate of the treatment effect. There is an ongoing debate between those who argue that randomisation is the only safe, unbiased means of assessing new interventions, and those who view randomisation as a narrow methodology of limited usefulness except for assessing drug treatments [13]. There are three sets of arguments as follows: 1. External validity: RCTs might lead to findings that overestimate treatment effects or do not have relevance to the settings that interest clinicians the most. 2. Feasibility: Sometimes it is impossible to mount RCTs for practical reasons. For example, an RCT of suicide prevention would need to randomise tens of thousands of people.
37 Patient non-participation (p) (patient has preference for specified treatment or aversion to research) Not invited to particioate (I) (administrative oversight or paractitioner preference)
Potential to benefit
Centre/docotor non-participation (d) (not invited or center/parctitioner preference) Subjects (s)
Ineligible (e)
Intervention A
Intervention B
Fig. 4.3 RCTs may have limitations in external validity, which imposes difficulties on the application of their findings in real clinical situations. Graph taken from: McKee et al. [14]
3. Rarity: The number of well-conducted RCTs of sufficient size to draw conclusions will always be limited. There are going to be many clinically relevant issues that will not be addressed with RCTs. Perhaps the main criticism is the limited external validity or generalisability [13, 14]. RCTs are strong on internal validity, i.e. drawing conclusions about the effectiveness of the treatment used in that particular setting, on those patients. However, clinicians, if not primarily interested, are also in the external validity of a trial. The key question is “Do the results apply to the circumstances in which the clinician works?” There are probably the following three main reasons why this can be a problem: 1. The professionals: The doctors and other professionals involved in trials are atypical, often with a special interest and expertise in the problem. 2. The patients: It is often difficult to recruit subjects to RCTs and the group of patients included is often very unrepresentative of the group eligible for treatment. This difficulty is often exacerbated by the investigators choosing a large number of “exclusion criteria” (Fig. 4.3). 3. The intervention: Many studies are carried out in prominent services, perhaps with dedicated research funds providing additional services. It is often difficult to know about the effectiveness of a similar intervention if it were applied to other services both in the country of the study or elsewhere in the world. Pragmatic RCTs are designed to answer clinically relevant questions in relevant settings and on representative groups of patients [10]. One of the priorities
38
of pragmatic trials is to ensure external as well as internal validity. Choosing clinically relevant comparisons is also essential and pragmatic trials are designed to reduce clinical uncertainty. In assessing a pragmatic trial, one should consider the representativeness and relevance of the following: (a) the patients in relation to the intended clinical setting, (b) the clinical setting, (c) the intervention(s) and (d) the comparisons. Economic assessment is often an important aspect of pragmatic trials. Clinicians, patients, and commissioners also need to know how much an intervention costs as well as whether it works. There will always be limitations on resources available for health care, and economics should help to make judgements on the best place to invest. This has to be done in conjunction with knowledge about the size of treatment effect. Trials also need to examine outcomes concerned with the “quality of life” of the subjects in addition to clinical outcomes. These measures should assess whether subjects are working, pursuing their leisure activities or require additional support.
4.1.6 Systematic Reviews and Meta-Analyses Secondary research aims to summarise and draw conclusions from all the known primary studies on a particular topic (i.e. those which report results at first hand) [5]. Systematic reviews apply the same scientific principles used in primary research to reviewing the literature. In contrast, more traditional or narrative review relies upon an expert to remember the relevant literature and to extract and summarise the data he or she thinks important. Systematic reviews ensure that all the studies are identified using a comprehensive method and that data are extracted from the studies in a standardised way. Meta-analysis provides a summary estimate of the results of the studies that are identified using a systematic review. It enables the results of similar studies to be summarised as a single overall effect with confidence intervals using formal statistical techniques. The main advantage of these studies is the resulting increase in the combined sample size (Table 4.2). A problem of secondary research is the presence of publication bias, i.e. small negative results are less likely
P. Skapinakis and T. Athanasiou Table 4.2 Advantages and disadvantages of secondary research Secondary research Advantages
Disadvantages
All evidence is used to assess an intervention
Publication and citation bias Limited by the quality of the primary studies
Increased statistical Pooling disparate studies may be power invalid (but such heterogeneity can be investigated Can investigate heterogeneity and test generalisability
to be published. Therefore, ideally, one should attempt a comprehensive search strategy that includes not only published results, but also those reported in abstracts, personal communications, and the like. Systematic reviews have mostly been used to summarise the results from RCTs (see Cochrane Collaboration below), but the same arguments apply to reviewing observational studies. A central issue in secondary research is heterogeneity [15]. This term is used to describe the variability or differences between studies in terms of clinical characteristics (clinical heterogeneity), methods and techniques (methodological heterogeneity) and effects (heterogeneity of results). Statistical tests of heterogeneity may be used to assess whether the observed variability in study results (effect sizes) is greater than that expected to occur by chance. Heterogeneity may arise when the populations in the various studies have different characteristics, when the delivery of the interventions is variable or when studies of different designs and quality are included in the review. Interpreting heterogeneity can be complex, but often clinicians are interested in heterogeneity in order to inform clinical decision-making [16]. For example, clinicians want to know if a particular group of patients responds well to a particular treatment. Meta-analysis has also been criticised for attempting to summarise studies with diverse characteristics. Investigating heterogeneity can also be used to address such concerns. The use of systematic reviews for the assessment of the effectiveness of healthcare interventions is largely promoted by Cochrane Collaboration. Archie Cochrane, a British epidemiologist who was based in Cardiff for much of his working life, recognised that people who
4
Study Design, Statistical Inference and Literature Search in Surgical Research
want to make more informed decisions about health care do not have ready access to reliable reviews of the available evidence. Cochrane emphasised that reviews of research evidence must be prepared systematically and they must be kept up-to-date to take account of new evidence. In 1993, seventy-seven people from eleven countries co-founded “The Cochrane Collaboration”. The Cochrane Collaboration aims to systematically review all the RCTs carried out in medicine since 1948. They are also committed to update the reviews as new evidence emerges. The mission statement of the Cochrane Collaboration is “Preparing, maintaining and promoting the accessibility of systematic reviews of the effects of healthcare interventions”. The Cochrane Collaboration’s website is www.cochrane.org. This has links to the Cochrane library which contains the Cochrane database of systematic reviews and the Cochrane-controlled trials register.
4.2 The Basics of Statistical Analysis 4.2.1 The Study Population The study population is the set of subjects about whom we wish to learn. It is usually impossible to learn about the whole population, so instead we look in detail at a subset or sample of the population. Ideally, we choose a sample at random so that it is representative of the whole study population. Our findings from the sample can then be extrapolated to the whole study population. Suppose that two random samples are selected from a large study population. They will almost certainly contain different subjects with different characteristics. For example, if two samples of 100 are chosen at random from a population with equal numbers of males and females, one may contain 55 females and the other 44 females. This does not mean that either sample is “wrong”. The randomness involved in sample selection has introduced an inaccuracy in our measurement of the study population characteristic. This is called sampling variation. Our aim is to extrapolate and draw conclusions about the study population using findings from the sample. Most of the statistical tests are therefore trying to infer something about the study population by taking account and estimating the sampling variation.
39
4.2.2 Hypothesis Testing Studies are usually designed to try to answer a clinical question, such as: “Is there any difference between two methods for treating coronary heart disease?” In hypothesis testing we formulate this question as a choice between two statistical hypotheses, the null and alternative hypotheses. The null hypothesis, H0, represents a situation of no difference, no change, equality, while the alternative hypothesis, H1, specifies that there is a difference or change. So we might have H0: there is no difference between two treatments for CHD H1: there is a difference between the methods We have to make a decision as to which we think is true. This is usually based on the P-value. This is essentially the probability of obtaining results at least as extreme as those obtained if the null hypothesis is true. A small P-value, such as 0.05, means it is unlikely that such a result would be obtained by chance and this offers evidence against H0, while a large P-value, such as 0.5, tends broadly to support H0. How small should the P-value be to reject H0? Traditionally, the critical level has been set at 0.05 or 5%. If P < 0.05 is the criterion for rejecting H0, then we say the result is significant at 5%. Other levels can be taken, but this is the most common. If the P-value exceeds 0.05, the decision is that we do not reject H0, rather than accepting H0. It is difficult to prove that there is absolutely no difference and we simply conclude that we cannot show there is a difference. The P-value is often incorrectly interpreted as the probability that the null hypothesis is true. This is not correct, the P-value just quantifies the evidence against the null hypothesis.
4.2.3 Type 1 and Type 2 Errors There are two types of wrong decision that can be made when a hypothesis test is performed. A Type I error occurs when the null hypothesis H0 is true but rejected. Five percent of all tests that are
40
significant at the 5% level are Type 1 errors. Carrying out repeated tests increases the chance of a Type 1 error. A Type II error occurs when the null hypothesis H0 is false but not rejected. For example, in a small study it is possible to record a non-significant P value, despite large true differences in the study population. Type II errors need to be considered in all “negative” studies. Confidence intervals will help to interpret negative findings (see below).
4.2.4 Statistical Power The statistical power of a study is the probability of finding a statistical significant result, assuming the study population has a difference of a specified magnitude. It is the probability of not having a Type II error. The power depends upon the following: • The level of statistical significance – usually 5% • The size of effect you assume in the study population • The sample size Calculating the power of a study is useful at the planning stage. The power calculation depends critically on the size of effect one wishes to find. When designing studies, the power is often set to 80% in order to determine the sample size. 80% is an arbitrary value, similar in that way to the 5% significance value.
4.2.5 Interpreting “Statistically Significant” Results Five percent (one out of every 20) of statistical tests will be statistically significant at the 5% level by chance. The 5% significance level is of course fairly arbitrary. There is no real difference in interpreting a 4 and 6% significance. If a study reports twenty P values, one would expect one to be “significant” by chance. Repeated tests increase the chance of Type 1 errors.
P. Skapinakis and T. Athanasiou
effect and so we also need to know how accurately we are estimating our proportion. Confidence intervals are based on an estimate of the size of the effect, together with a measure of the uncertainty associated with the estimate of the size. The standard error (SE) tells us how precisely our sample value estimates the true population value. If we take a lot of similar size samples from the same population, the SE can be thought of as the standard deviation of the sample means. If we increase the sample size, we decrease the SE as we are estimating the study population value with more accuracy. A 95% confidence interval is constructed so that in 95% of cases, it will contain the true value of the effect size. It is calculated as: 95% CI = Estimated value + (1.96*SE) We can use different levels of confidence if we wish. Based on the same information, a 99% confidence interval will be wider than a 95% one, since we make a stronger statement without any more data; similarly, a 90% one will be narrower. In recent years it has been generally agreed that results should be summarised by confidence intervals rather than P-values, although ideally both will be used. The P-value does not give an indication of the likely range of values of the effect size, whereas the confidence interval does.
4.2.7 Interpreting “Negative” Results When a trial gives a “negative” result, in other words, there is no statistically significant difference, it is important to consider the confidence intervals around the result. We must remember that a study estimates the result inaccurately. The confidence intervals give the range within which we are 95% confident the “true” value lies. Small negative trials will usually have confidence intervals that include differences that would correspond to potentially important treatment effects. One way of thinking about results is that they are excluding unlikely values. The confidence interval gives the range of likely values and an effect size outside the confidence interval is unlikely.
4.2.6 Confidence Intervals
4.2.8 Correlation and Regression
We know that our sample estimate from a study (e.g. a proportion, a mean value or an OR) is subject to sampling error. We are primarily interested in the size of
Linear regression allows the relationship between two continuous variables to be studied. The regression line is the straight line that fits the data best. The correlation
4
Study Design, Statistical Inference and Literature Search in Surgical Research
Residual
Predicted
x Fig. 4.4 The regression line
coefficient varies between −1 and 1. The more that correlation coefficient departs from 0, the more variation is explained by the regression line. A negative correlation coefficient arises when the value of one variable goes down as the other goes up. Each observation can be thought of as a “predicted” value, i.e. that would lie on the regression line, and a “residual” that is the difference between the predicted value and the observed value (Fig. 4.4). The total variance is therefore the predicted variance added to the residual variance. The correlation coefficient is the predicted variance divided by the total variance. If all the points lie on the line, then the correlation coefficient is 1. The slope of the line is sometimes called the regression coefficient. It gives the increase in the mean value of y for an increase in x of one unit.
4.3 Causal Inference 4.3.1 What Is a Cause? One of the principal aims of epidemiology is to investigate the causes of health-related states and events in specified populations. In general, we are interested in finding the causes of disease because we want to be able to intervene to prevent disease occurring. But what is a cause? It is useful to recall that our ideas about causes are not static and universal and the concepts
41
of disease causation have changed dramatically over time. In the past, several conceptual models of causation had been developed, including the miasma theory in the early century (all diseases were due to bad air) and the germ theory in the second half of the nineteenth century and first half of the twentieth century (diseases were caused by single agents operating at the level of the individual). Robert Koch, one of the key figures of the germ theory, proposed in 1880 some general criteria for causality (the Henle-Koch postulates). According to these, a particular agent can be considered as the cause of a disease if it is both necessary (“The agent must be shown to be present in every case of disease in question”) and specific (“The agent must not be found in cases of any other disease”). In the second half of the twentieth century, ideas about causation began to change for a number of reasons. First, it was realized that recognized pathogens such as the tubercle bacillus could be carried for long periods of time without causing disease (i.e. the bacillus was not sufficient to cause disease). Second, there was a shift of attention from infectious diseases to heart disease and cancer, where various factors were related to risk but none absolutely necessary. As the “new” chronic diseases did not appear to have a single specific cause, epidemiologists became interested in how multiple factors could interact to produce disease. This led to the development of the multifactorial paradigm, which is the dominant theory of causation in contemporary epidemiology.
4.3.2 The Multi-Factorial Model of Causation In the multi-factorial model, a cause of a specific disease is any factor that plays an essential role in the occurrence of the disease. A cause can be either an active agent or a static condition. Within this framework, a single cause is not sufficient to produce disease. Rather, a number of different causal factors act together to cause each occurrence of a specific disease. Rothman has elaborated a model of component causes that attempts to accommodate the multiplicity of factors that contribute to an outcome (Fig. 4.5). In this model, a sufficient cause is represented by a complete circle (a “causal pie”), the segments of which represent component causes. When all of the component causes are present, the sufficient cause is complete and the outcome occurs. As shown in the figure, there may be more than one sufficient cause (i.e. circle) of the
42
P. Skapinakis and T. Athanasiou
Causal pie
Unknown Causes
Fig. 4.6 Interpre ting an association
Is it due to systematic bias?
Components of Sufficient causes No U
U A
B
A
U C
A
E
Causal Complement of A
Fig. 4.5 Rothman’s model of sufficient and component causes
outcome, so that the outcome can occur through multiple pathways. A component cause that is a part of every sufficient cause is a necessary cause. If the causal complement of a factor is absent in a particular population, then the factor will not cause disease in that population. If every individual in a population is exposed to the causal complement of a factor, then exposure to that factor will always produce disease. The strength of a factor’s effect on the occurrence of a disease in a population therefore depends on the prevalence of its causal complement in that population. Because of this, a particular factor may be an important cause of a specific disease in one population, but may not cause any of the disease in another.
4.3.3 Evaluating Causality in the Multi-Factorial Model In contemporary research, we spend a lot of time looking for associations between exposures and outcomes. If we find an association between an exposure and a disease, how can we judge whether this relationship is causal? An association between an exposure and a disease can be explained by five alternative interpretations as follows (Fig. 4.6): • • • • •
Chance Bias Confounding Reverse causality Causation
It is important to emphasise that all study designs, including the RCT, are concerned with causal inference. In an RCT, we are interested in whether the treatment allocation “causes” an increased rate of recovery.
Could it be due to Confounding ?
No
Could it be a result of Chance ?
No
Is it causal?
Apply positive criteria of causality
Chance: Significance testing assesses the probability that chance alone can explain the findings. Calculating confidence intervals gives an estimate of the precision with which an association is measured. A Type 1 error occurs when a statistically significant result occurs by chance. It is a particular problem when many statistical tests are conducted within a single study in the absence of a clearly stated prior hypothesis. A Type 2 error occurs when a clinically important result is obscured by chance or random error, often made more likely by inadequate sample size. For “negative” findings, the confidence interval gives a range of plausible values for the association. Bias: Systematic error or bias can distort an association in any direction, either increasing or decreasing the association. No study is entirely free of bias, but attention to the design and execution of a study should minimise sources of bias. There are two main types of bias in research studies: selection bias and information or measurement bias. Selection bias results from systematic differences in the characteristics between those who are selected for a study and those who are not. Therefore, any observed association between the exposure and the outcome may not be real, but may be due to the procedure used to select the participants. Information or measurement bias refers to errors that result from inaccurate measurement of the exposure,
4
Study Design, Statistical Inference and Literature Search in Surgical Research
43
Table 4.3 Causality criteria Coffee consumption
Cancer of the Pancreas
The Bradford Hill criteria Temporality (the exposure occurs before the outcome)
Smoking
Fig. 4.7 Smoking is a confounder for the association between coffee and cancer of the pancreas because it is associated with both the exposure (coffee) and the outcome (cancer)
Strength of the association (strong associations more likely causal) Consistency (same results with different methods) Dose–response relationship Specificity of the association
the outcome or both. For example, it is well known that retrospective assessment of one’s own fat intake may be inaccurate and could introduce an information bias in either direction in a study aiming to find whether there is an association between fat intake and colon cancer. If subjects with colon cancer were more likely to overestimate their daily fat intake relative to controls, this would increase the chances of finding a statistically significant association. If, on the other hand, controls were more likely to underestimate their fat intake, this would reduce the possibility of finding a significant association. Confounding: Confounding occurs when an estimate of the association between an exposure and disease is an artefact because a third confounding variable is associated with both exposure and disease (see Fig. 4.7 for an example). All observational studies are susceptible to confounding. Always think of potential confounders when interpreting studies. Even in an RCT there can be an imbalance (by chance) in important confounders between the allocated interventions. Reverse causality: This is the possibility that the exposure is the result rather than the cause of the disease. For example, is marital discord a risk factor for depression or is depression causing marital discord? This issue is more important in case–control studies and cross-sectional surveys that assess exposure after the onset of disease. Cohort studies usually eliminate this possibility by selecting people without the disease at the beginning of the study. RCTs also select people at the beginning of the trial who are ill in order to examine outcome. However, it can remain a problem for some conditions where the timing of the onset of disease remains a matter of debate. Causation: An association may indicate that the exposure causes the disease. Trying to infer causation is a difficult task. It is usually helpful to review the epidemiological literature to decide whether there is a consistent finding, irrespective of the population or
Biological plausibility Coherence (no conflicts with current knowledge) Experimental evidence Analogy (similar factors cause similar diseases)
study design. When there is a strong association, the likelihood that the relationship is causal is increased. For example, for relative risks over 3 or 4, confounding and bias have to be quite marked to explain the findings. However, there is generally little confidence in findings when the relative risk is 1.5 or below. A dose–response relationship can also provide additional evidence for causality, depending upon the hypothesised mechanism of action. For example, one would expect that more severe obstetric complications would lead to higher rates of schizophrenia than milder forms if those were causal agents. Finally, the scientific plausibility of the findings has to be considered.
4.3.4 Bradford Hill’s Criteria for Causality In 1965, the epidemiologist Sir Austin Bradford Hill proposed a set of criteria to help evaluate whether an observed association between an exposure and an outcome is likely to be causal (Table 4.3). These criteria help us to judge both whether an association is valid, and whether it is consistent with existing knowledge. None of the criteria alone can prove that a relationship is causal. However, used together, they help us to make an overall judgement about whether a causal relationship is likely. As these criteria are still used today, we will now look at them in some detail: 1. Strength of the association – The stronger an association, the less it could merely reflect the influence of some other aetiological factor(s).
44
2. Consistency – Replication of the findings by different investigators, at different times, in different places, with different methods and the ability to convincingly explain different results. 3. Specificity of the association – There is an inherent relationship between the specificity and strength, in the sense that the more accurately defined the disease and exposure, the stronger should be the observed relationship. But the fact that one agent contributes to multiple diseases is not an evidence against its role in any one disease. 4. Temporality – Does the exposure precede the disease or is reverse causality possible? 5. Biological gradient – Results are more convincing if the risk of disease increases with the level of exposure. 6. Plausibility – We are much readier to accept that a specific exposure is a risk factor for a disease if this is consistent with our general knowledge and beliefs. Obviously this tendency has pitfalls. 7. Coherence – How well do all the observations fit with the hypothesized model to form a coherent picture? 8. Experiment – The demonstration that under-controlled conditions changing the exposure causes a change in the outcome is of great value for inferring causality. 9. Analogy – Have similar associations between similar exposures and other diseases been shown? We are readier to accept arguments that resemble others we accept.
4.4 Clinical Importance of the Results: Types of Health Outcomes and Measures of the Effect 4.4.1 Health Outcomes We are interested in the changes (referred to as outcomes) amongst the research subjects which are associated with exposure to risk factors or therapeutic or preventive interventions. There are two main types of outcomes (Table 4.4) [17]: (a) biological or psychosocial parameters not directly related to disease (for example cholesterol values or scores in a scale measuring social support) and (b) clinical outcomes directly related to disease.
P. Skapinakis and T. Athanasiou Table 4.4 Outcomes in the course of a disease. Adapted from Muir Gray et al. [9] Types of health outcomes Death Disability Disease status Dissatisfaction with process of care Discomfort about the effects of disease
Non-clinical outcomes can only be viewed as surrogates for the clinical outcomes of interest and cannot be used directly to change clinical practice unless there is a clear causal association between this and a clinical outcome. Clinicians are, thus, more interested in research papers that have used relevant clinical outcomes. Outcomes in the course of disease include the following: death, disease status, discomfort from symptoms, disability and dissatisfaction with the process of care. These can easily be memorized as the five Ds of health outcomes. In establishing the clinical importance of a study, one should always check that the outcome is relevant.
4.4.2 Clinical Importance A study may be methodologically valid, with an outcome of interest to clinicians, but still not be clinically relevant because the effect of treatment for example is negligible. A new antihypertensive drug which lowers systolic blood pressure by 5% compared to routine treatment is probably not clinically significant in the sense that it has no implications for patient care. There are two broad categories of measures of effect, relative measures (e.g. relative risk, OR) and absolute measures (e.g. attributable risk) (Table 4.5). In the clinical context, we are more interested in absolute measures because the relative ones cannot discriminate large treatment effects form small ones. For example, in a clinical trial if 90% of the placebo group developed the disease compared to 30% of the treatment group, the relative risk reduction would be [(90−30)/90%] = 66% and the absolute risk reduction (ARR) 90−30% = 60%, a clinically important result. However, in a trial with a 9% control event rate vs. a 6% experimental event rate, the relative risk reduction is the same but the ARR is 3%, a figure not important from the clinical
4
Study Design, Statistical Inference and Literature Search in Surgical Research
45
Table 4.5 Measures of effects Effect measures Effect measures for binary data
Absolute
Relative
Odds: The number of events divided by the number of Absolute risk reduction (ARR): The absolute difference in risk between the experimental and non-events in the sample control groups Number needed to treat (NNT): The number of patients that need to be treated with the experimental therapy in order to prevent one of them to develop the undesirable outcome. It is the inverse of ARR
Odds ratio (OR): The ratio of the odds of an event in the experimental group to the odds of an event in the control group Risk: The proportion of participants in a group who are observed to have an event Relative Risk (RR): The ratio of risk in the experimental group to the risk in the control group
Effect measures for continuous data
Mean difference: The difference between the means of two groups Weighted mean difference: Where studies have measured the outcome on the same scale, the weight given to the mean difference in each study is usually equal to the inverse of the variance Standardised mean difference: Where studies have measured an outcome using different scales (e.g. pain may be measured in a variety of ways) the mean difference may be divided by an estimate of the within-group standard deviation to produce a standardised value without any units
Effect measure for survival data
Hazard ratio: A summary of the difference between two survival curves. It represents the overall reduction in the risk of death on treatment compared to control over the period of follow up of the patients
perspective. In the following paragraphs we present briefly the basic measures of the effect used in clinical research. Attributable risk (ARR) (risk difference, rate difference) is the absolute difference in risk between the experimental and control groups. A risk difference of zero indicates no difference between the two groups. For undesirable outcomes, a risk difference that is less than zero indicates that the intervention was effective in reducing the risk of that outcome useful for interpreting the results of intervention studies. The number needed to treat (NNT) is an alternative way of expressing the attributable risk between two groups of subjects. It has been promoted as the most intuitively appealing way of presenting the results of RCTs and its use should be encouraged in interpreting the results of trials [2]. The NNT is the number of patients who need to be treated with the experimental therapy in order to prevent one of them to develop the undesirable outcome. It is calculated as the reciprocal of the absolute difference in risk (probability of recovery) between the groups. A NNT of 5 indicates that 5 patients need to be treated with treatment A rather than treatment B if one person is to recover on treatment A who would not have recovered on treatment B.
The following example can help to better understand these measures: An RCT of depression finds a 60% recovery with an antidepressant and a 50% recovery on placebo after 6 weeks treatment. The absolute risk difference is 10% (or P = 0.1). The NNT is 1/0.1 = 10. Therefore, if 10 patients were treated with the antidepressant, 1 would get better who would not have got better if treated with placebo. Another way of thinking of this is if there were 10 patients in each group, 6 would get better on the antidepressant and 5 on the placebo. Relative risk is a general and rather ambiguous term to describe the family of estimates that rely upon ratio of the measures of effect for the two groups. They are not the best way of summarising treatment trials. This is because we are interested in the absolute change in risk rather than the relative risk. Ratio measures are more useful in interpreting possible causal associations. Ratio measures estimate the strength of the association between exposure and disease. Incidence Rate Ratio is a further “relative risk” measure. It is when incidence rates are compared. These are commonly used in longitudinal or cohort studies. Epidemiologists often prefer to use odds rather than probability in assessing the risk of disease. The following are three main reasons for this:
46
P. Skapinakis and T. Athanasiou
1. The mathematics of manipulating ORs is easier. 2. The results of logistic regression can be expressed as ORs. Therefore, it is possible to present results before and after multivariate adjustment in terms of ORs. 3. Finally, ORs are the only valid method of analysing case–control studies when the data are categorical. The OR from the case–control study corresponds to the OR in the population in which the case–control study was performed. For rare outcomes, the OR, risk ratio and incidence rate ratio have the same value. To illustrate calculating odds and ORs, the following table can be thought of as the results of either a cross-sectional survey, cohort study or case–control study. The odds of an event = number of events/number of non-events. Cases
Controls
Exposed
a
b
Unexposed
c
d
The OR = odds in the treated or exposed group/ odds in the unexposed group. The odds of being a case in the exposed group is a/b. Similarly, in the unexposed group, the odd of being a case is c/d. The OR is therefore (a/b)/(c/d), and after manipulating algebraically, (ad)/(bc). The OR is, therefore, a “relative odds” and gives an estimate of the “aetiological force” of an exposure. If the OR is greater than 1, the exposure may be a risk factor for the disease. If the OR is less than 1, the exposure (often a treatment) may protect against the disease. If the OR is exactly equal to 1, there may be no association between exposure and disease.
4.5 Searching Efficiently the Biomedical Databases With so many papers published annually in medicine, it is important to know how to search more efficiently for research papers that have the two following characteristics: • They are relevant to our clinical or research question. • They are of high quality.
Sometimes we need a few high quality papers to quickly answer a clinical question. Other times, we need to find all available studies on our chosen topic. In both situations, it helps to know how to improve our skill in searching the biomedical databases, and especially, the “Pubmed” which is provided for free by the US National Library of Medicine (http://www.ncbi. nlm.nih.gov/sites/entrez).
4.5.1 Structure of a Database A database is simply consisting of fields (columns) and records (rows). The example of the telephone catalogue is very familiar to all. A telephone catalogue includes fields such as the surname, the first name, the address and the telephone number. Each entry in the catalogue is one unique record that has values for each one of the fields. Each field may take particular values, for example in a research database the field “marital status” may only take the values: single, married, divorced, widowed. Knowledge of the field structure of a database is essential if we want to search efficiently. We would never open the telephone catalogue if we wanted to select all those who are married in a given area, simply because this field does not exist in this database. Additionally, in the previously mentioned research database, if we search for all those people who are separated we will get zero records, simply because the field marital status does not take the value “separated”. Therefore, knowing the exact field structure of a database (field names, description of fields and typical values) is essential if we want to search efficiently.
4.5.2 Structure of PubMed Each record in the Pubmed database consists of a unique research paper published in one of the thousand journals that are indexed in this database. Clearly, not all the journals are indexed by Pubmed, but it is good to know that the few hundreds of high quality journals are indexed. Most users of Pubmed simply look at the title, authors or abstract of an indexed paper. However, these are only a few of the available fields provided by the
4
Study Design, Statistical Inference and Literature Search in Surgical Research
database. There is a very simple way to get a copy of the record that will provide us with more information on the structure of the database than the default display. To do this, we need to select from the upper-left dropdown menu the display option “MEDLINE” (Fig. 4.8). This will give us information on some of the field types and field values included in a particular record. The following example gives some of the field values
Fig. 4.8 Structure of MEDLINE on Pubmed
Fig. 4.9 Specific Paper Search on Pubmed
47
assigned to a particular paper indexed by Medline. By looking at this figure we can see that there is a field called “MH”, which takes the value “Aorta, Thoracic/*surgery” (Fig. 4.9). This field refers to the Medical Heading fields that we will discuss later. In order to get to know all the possible fields of Pubmed, we need to open the help file provided by the database on the left (Fig. 4.10).
48 Fig. 4.10 Pubmed Menu
P. Skapinakis and T. Athanasiou
Pubmed help includes a detailed description of each field and its associated values. Each field has a unique field tag most often consisting of two capital letters (but there are exclusions) and it takes specific values. To search for records that contain a specific value of a field, we should include the value of the field in inverted commas and then the field tag in square brackets. For example, if you want to search for papers on depression and get only articles written in English, you should include the field Language as in the example below: (Fig. 4.11). Apart from the obvious fields such as the author field (AU), the Title/Abstract field (TIAB) or the Publication Date (DP) field, there are some other important fields that we may need in order to increase our search efficiency. A list of the fields and their corresponding tags can be found in the figure below (Fig. 4.12). By clicking on them you can get the description of the field and the range of accepted values. 4.5.2.1 Some Important Fields 1. Medical subject headings or MeSH terms Field tag: [MH] According to pubmed help: “Medical Subject Headings are the controlled vocabulary of biomedical terms that have been used to describe the subject of each journal
Fig. 4.11 Example Pubmed Search
Fig. 4.12 Pubmed Field Description and Tags
4
Study Design, Statistical Inference and Literature Search in Surgical Research
article in MEDLINE.” MeSH contains more than 23,000 terms and is updated annually to reflect changes in medicine and medical terminology (in other words, the field MH can take 23,000 values). Skilled subject analysts examine journal articles and assign to each the most specific MeSH terms applicable – typically ten to twelve. Applying the MeSH vocabulary ensures that articles are uniformly indexed by subject, whatever the author’s words”. Given that papers indexed by MEDLINE have already been assessed by skilled librarians and subject analysts, the use of MeSH terms instead of simple text words (in titles and/or abstracts) is expected to increase the sensitivity (the number of relevant papers retrieved divided by the total number of relevant papers available into the database – the denominator is usually unknown and can
Fig. 4.13 Pubmed MESH database search
Fig. 4.14 Pubmed MESH database search for “depression”
49
only be roughly estimated) and specificity (number of relevant records retrieved divided by the total number of records retrieved – this is always known) of our search. The problem in using the MeSH terms is that we may do not know the exact MeSH term for the topic of our interest. PubMed, however, allows us to search and browse the MeSH vocabulary. To do this, we should select the MeSH database from the available databases of the menu on the left and search for a particular topic of interest. The MeSH database will give us some suggestions for some of the MeSH terms that may be related to our search (Fig. 4.13). For example, if we want to search for papers on depression, we can search the MeSH database for depression (Fig. 4.14). We learn that the subject analysts of MEDLINE assign the term “depressive disorder” in
50
papers that include information on depression, independently of whether the author used the term “depression” or some other term, e.g. sadness, melancholia, psychological distress and so on. Searching Pubmed for depression as a text word yields about 200,000 papers, while searching for the MeSH term “depressive disorder” (to do that we write in the pubmed search box “Depressive Disorder” [MH]) yields 50,000 papers. Thus, we have efficiently limited this very broad category by almost 150,000 papers! 2. Publication type Field tag: [PT] This field describes the type of material the article represents and includes values such as “Review”, “Randomized Controlled Trial”, “Meta-Analysis”, “Journal Article”, “Letter” etc. For example, to search for all randomized controlled trials included in the database, we can search for “Randomized Controlled Trials” [PT] (Fig. 4.15). This search returned 247,264 records at the time this text was written. 3. Citation status subset Field tag: [SB] The citation status indicates the processing stage of an article in the PubMed database. This may take the values: “publisher”, “in process” and “medline”. Normally there is a delay from the date a paper appears in the database to the date this has been assessed by the subject analysts. MeSH terms are only available for the records that have been indexed by medline. Records that are “supplied by the publisher” or are “in process” may not have been assigned MeSH terms. Quite often, these papers will be the most recent ones. One has to take this into account when searching with MeSH terms. For this reason, it is always a good practice to supplement the search with a general keyword search in those records that are in process or are supplied by the publisher.
Fig. 4.15 Pubmed Field tag: [PT] search
P. Skapinakis and T. Athanasiou
For the randomized controlled trials we mentioned before we could also search for “randomized controlled trial” AND (“publisher” [SB] OR “in process” [SB]). This search will return 59 records, but not all of them will be randomized controlled trials. When searching for more specific topics, one can search for the text word “randomized” or “random*” in the “publisher” or the “in process” subsets to increased sensitivity.
4.5.2.2 Steps for an Efficient Search Step 1: Start with stating clearly your objectives You should be able to identify the main objective of your question. Think of what would be your ideal article – what question do you want to find an answer to? What would be the design of the study, if you would carry out your own study to answer this question. Step 2: Choose the relevant MeSH terms or keywords This will depend on the specific question you are interested in. It is implied that a basic knowledge of study design and research methodology is required. For example, if you want to search for the best available treatment for depression after a major coronary heart event, it will be important to narrow down your search only to randomized controlled trials. Or, if you need to search for papers on the prognosis of a condition, it will be better to focus on prospective studies only (by searching for the MeSH term “Prospective Studies” [MH]). In the case of a new diagnostic test, searching for the relevant MeSH terms (“Sensitivity and Specificity” [MH]) might help. Step 3. Choose language, year span and any other limitations you wish to apply, such as publication type or age range You should be careful, however, when you apply limitations, and you should only do this if your first search had a very low specificity.
4
Study Design, Statistical Inference and Literature Search in Surgical Research
Step 4: Identify a relevant paper from the search list and find out which MeSH terms have been assigned to it This is a very efficient way to increase the sensitivity and specificity of your search. Sometimes it is difficult to think of all the relevant keywords or MeSH terms. However, we may come across a particular paper that it is very relevant to answer our question (or perhaps such a paper was the reason to start searching the literature in this particular topic). We can identify this paper in pubmed and look at the MeSH terms that have been assigned to it. Then we can include in our search string some of these MeSH terms. Step 5: Search for papers that have cited a very important (preferably older) paper in the chosen topic This is a very important step, especially if your first search did not yield enough relevant papers. In your a
b
Fig. 4.16 (a, b) Using Google Scholar to find citations to a paper
51
topic of interest there may be a very important paper, the one that perhaps started the discussion some years ago. It is reasonable to make the hypothesis that many of the subsequent papers would have cited this first or would have treated this as an importantpaper. Looking at who has cited this paper will give you very relevant papers that you can include to your list. Until very recently, in order to find citations to a paper you should use a service like the one provided by ISI (institute of scientific information), which is not free. Google, however, revolutionized this aspect by bringing us scholar google (scholar.google.com). If you search for a particular paper in google scholar, you can see that below the title on the left there is a hyperlink with the text (Fig. 4.16a & 4.16b). By clicking on that link you can get a list of all papers citing this particular paper. Certainly, this is a nice present of google to researchers!
52
Using Google Scholar to find citations to a paper Step 6: Search the list of references of the relevant papers This is the last necessary step to make sure that we have retrieved all relevant papers for our topic. Very often we will find papers that we were unable to get from our search, no matter how sensitive it was. Step 7: Use (with caution) the related papers link in Pubmed You may find that clicking on the related papers link on Pubmed returns some useful records, but this technique is not so strong as the steps 5 and 6 that were mentioned before. Therefore, use it with caution. 4.5.2.3 Some Last Tips 1. The use of clinical queries Pubmed provides some very useful search filters in order to help clinicians get high quality results on a given topic. There are filters customized for high sensitivity and specificity on diagnosis, aetiology, prognosis and
Fig. 4.17 Pubmed search using Clinical Queries
P. Skapinakis and T. Athanasiou
therapy (Fig. 4.17). You have the option to choose between a broader or a narrower search (higher or lower sensitivity). The broad filter used for therapy for example is the following: ( (clinical[Title/Abstract] AND trial[Title/Abstract]) OR clinical trials[MeSH Terms] OR clinical trial[Publication Type] OR random*[Title/Abstract] OR random allocation[MeSH Terms] OR therapeutic use[MeSH Subheading])
and the narrow filter is: (randomized controlled trial[Publication Type] OR (randomized[Title/Abstract] AND controlled[Title/Abstract] AND trial[Title/Abstract]) )
and as you can see, it is based heavily on controlled trials, especially randomized controlled trials. (a) Last (but not least): Read the manual and take the online tutorial This is the easiest thing you can do to get familiar with Pubmed, but in our experience, very few physicians have ever taken this tutorial or read the pubmed help manual. Spending some time to read this material or
4
Study Design, Statistical Inference and Literature Search in Surgical Research
take the online tutorials will save you many hours of purposeless searching through the various databases. Knowing how to make efficient searches in the literature should not be considered as a minor procedure. This is a very serious task and all doctors should learn how to do this from very early in their career.
References 1. Last JM (1995) A dictionary of epidemiology, 3rd edn. Oxford University Press, New York 2. Sackett B, Straus S, Richardson WS et al (2000) Evidence based medicine: how to practice and teach EBM, 2nd edn. Churchill Livingstone, Edinburgh 3. MacMahon B, Trichopoulos D (1996) Epidemiology: principles and methods. Little Brown, Boston 4. Ward PR, Noyce PR, St Leger AS (2007) How equitable are GP practice prescribing rates for statins?: an ecological study in four primary care trusts in North West England. Int J Equity Health 6:2 5. Lewis G, Churchill R, Hotopp M (1997) Systematic reviews and meta-analysis. Psychol Med 27:3–7 6. Skapinakis P, Lewis G, Meltzer H (2000) Clarifying the relationship between unexplained chronic fatigue and psychiatric morbidity: results from a community survey in Great Britain. Am J Psychiatry 157:1492–1498
53
7. Kendell RE, Juszczak E, Cole SK (1996) Obstetric complications and schizophrenia: a case control study based on standardised obstetric records. Br J Psychiatry 168: 556–561 8. Kendell RE, McInneny K, Juszczak E et al (2000) Obstetric complications and schizophrenia. Two case-control studies based on structured obstetric records. Br J Psychiatry 176: 516–522 9. Wilson PW, Abbott RD, Castelli WP (1988) High density lipoprotein cholesterol and mortality. The Framingham Heart Study. Arteriosclerosis 8:737–741 10. Hotopf M, Churchill R, Lewis G (1999) Pragmatic randomised controlled trials in psychiatry. Br J Psychiatry 175: 217–223 11. Pocock SJ (1983) Clinical trials: a practical approach. Wiley, Chichester 12. Kunz R, Oxman AD (1998) The unpredictability paradox: review of empirical comparisons of randomised and nonrandomised clinical trials. BMJ 317:1185–1190 13. Black N (1996) Why we need observational studies to evaluate the effectiveness of health care. BMJ 312: 1215–1218 14. McKee M, Britton A, Black N et al (1999) Methods in health services research. Interpreting the evidence: choosing between randomised and non-randomised studies. BMJ 319: 312–315 15. Thompson SG (1994) Why sources of heterogeneity in meta-analysis should be investigated. BMJ 309: 1351–1355 16. Lau J, Ioannidis JP, Schmid CH (1998) Summing up evidence: one answer is not always enough. Lancet 351: 123–127 17. Muir Gray JA (1997) Evidence-based health care. how to make health policy and management decisions. Churchill Livingstone, New York
5
Randomised Controlled Trials: What the Surgeon Needs to Know Marcus Flather, Belinda Lees, and John Pepper
Contents 5.1
Introduction ............................................................
55
5.2
Current Concepts in Clinical Trials .....................
56
5.3
Basic Concepts of Trial Design..............................
56
5.4
What Do I Need to Know Before Starting a Clinical Trial? ......................................................
57
5.5
How Do I Evaluate Surgical Procedures? ............
59
5.6
The Process of Surgical Evaluation ......................
59
5.7
What Do I Do When the Learning Curve Has Been Completed and the New Procedure Appears Feasible and Safe? ................
60
5.8
Randomised Trials of Surgical Procedures ..........
60
5.9
Selection of Outcomes for Surgical Trials ............
62
5.10
Practical Issues for Designing Randomised Trials in Surgery .....................................................
63
5.10.1 5.10.2 5.10.3 5.10.4 5.10.5 5.10.6
Eligibility Criteria .................................................... Process of Randomisation ........................................ Blinding of Treatment Allocations ........................... Sample Size and Enrolment Issues .......................... Costs of Doing Surgical Trials ................................. Balance of Benefits and Risk ...................................
63 63 63 63 64 64
5.11
Conclusions .............................................................
64
References ...........................................................................
64
Abstract In this chapter, we discuss some of the methodological issues in the design and conduct of clinical trials of surgical procedures. The opinion of experts in trial design, management and analysis should be sought at an early stage of a surgical clinical trial programme, recognising that other study designs may also be important including the prospective protocol-driven registry and the case control study.
5.1 Introduction
M. Flather () Clinical Trials and Evaluation Unit, Royal Brompton Hospital and Imperial College, London SW3 6NP, UK e-mail:
[email protected] The basic principle of assessing the usefulness of treatments can be applied across the whole spectrum of diseases whether they are complex surgical procedures or simple to administer pharmacological therapies. Answering a clinically important question reliably is the basis of any therapeutic evaluation [1]. In a notorious article in the Lancet in 1996, Richard Horton, the editor, used the title “Surgical research or comic opera: questions, but few answers” [2] and raised concerns about the methods used to evaluate surgical treatments and the research base of most surgical procedures. He argued that most of the information about surgical procedures was observational or anecdotal, whereas spending on surgery using this evidence was a major factor in NHS health care costs.Surgeons were naturally upset by this accusation, especially as there had already been a number of important randomised trials involving surgical treatments [3, 4], but it seemed true that most trials were small and not conclusive. However, the message hit home, and there are growing departments of academic surgery where clinical trials and other systematic evaluations are taking place alongside more traditional basic, experimental and observational science, and surgeons are seeking new skills in the design and interpretation of clinical research and linking with specialist trials units [5].
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_5, © Springer-Verlag Berlin Heidelberg 2010
55
56
How do we reliably evaluate surgically-based treatments, which by their very nature are complex, and invasive strategies involving many health professionals (surgeon, anaesthetist, specialist nurses, intensivists, etc.), specialist facilities (operating rooms, intensive care, specialist wards) and equipment (special instruments, cardiolpulmonary bypass, prosthesis)? If we define a “treatment” as “any activity, procedure or drug provided to a patient with the intention of improving health”, we can, for the purposes of designing clinical trials and other evaluations, suppose that all treatments are like a “black box”, and irrespective of its contents, we can evaluate one treatment much the same as another. While this concept is intuitively attractive, the complexity of the treatment has a huge impact on the practical elements of trial design, delivery of the treatment and its costs, and therefore, it is too simplistic to imagine that surgical treatments can be evaluated in randomised controlled trials in the same way as drug treatments [6].
5.2 Current Concepts in Clinical Trials In essence, clinical trials should not be regarded in any different way to an experiment in the laboratory. A hypothesis is posed, an experiment designed with “treatment” and “control” groups, observations made and analysed, and inferences made about whether the hypothesis has been supported or not. It is probably the multi-dimensional approach and organisational complexity of clinical trials that set them apart from laboratory experiments and also the fact that subjects taking part are human. Clinical trials are also often complicated and costly to set up and run. A broad range of treatments can be evaluated. Trials can be led and organised by independent academic groups to improve health care and enhance reputations, or by industry to improve longerterm profits. Industry and academia have very different philosophies, cultures and aims, and yet many of the most successful health care improvements have been the result of fruitful partnerships between industry on one hand, with its expertise in new product development and commercial drive, and academic investigators on the other, who have experience of dealing with patients and their diseases and who understand the application of potentially complex trial protocols [7]. This partnership is often required in surgical research as most surgical treatments involve
M. Flather et al.
innovative instruments and devices designed to make operations easier, safer and more effective.
5.3 Basic Concepts of Trial Design One key issue to address is “what is a clinical trial” as this term has many different apparent meanings. In its simplest form, a clinical trial is a systematic evaluation of a treatment in human subjects comparing the treatment of interest with a control. Usually, a clinical trial is synonymous with “randomised controlled trial” in which the experimental treatment and control treatment are allocated to participating subjects in a random manner. Study design is critically dependent on the existing knowledge of the treatment under evaluation. For example, if we were to evaluate aspirin for new indication (e.g. bowel cancer), the prior knowledge and experience of aspirin in clinical trials and clinical practice (literally millions of patient-years of experience) would inform the rate of expected side effects and it would be the efficacy of treatment in this indication that would be the major unknown. However, if a new surgical procedure, e.g. left ventricular trans apical implantation of an aortic valve, was being evaluated, then safety and feasibility of the procedure are the main goal. Thus, in the evaluation of treatments with relatively little human experience, safety and feasibility are the key early goals. In spite of this, it is often not acceptable to perform safety and feasibility trials without trying to detect some additional information on efficacy. For treatments in the early stages of development, these outcome measures will usually be determined by the postulated mechanism of benefit of the treatment. For example, a clinical trial designed to test a new knee prosthesis may initially study parameters such as ease of insertion, length of operation and post-operative function, whereas with more experience, the main aims would be longer-term durability, function and costs [8]. Outcome measures that ultimately determine whether a treatment could be used routinely in clinical practice are as follows: 1. Clear and reliable evidence of efficacy on clinically important parameters (reduction of clinically important outcome events or improvement of symptoms) 2. Acceptable safety 3. Acceptable health care costs
5
Randomised Controlled Trials: What the Surgeon Needs to Know
Fig. 5.1 Proposed sequence of evaluation of surgical procedures. In this diagram, we have emphasised the key roles of identifying the clinical question to be addressed and the prospective protocol-driven registry. Randomised trials are desirable, but may not be practical in all situations
Small Case series
57 Laboratory studies
Animal Models
Clinical question and hypothesis
Literature review, rationale, protocol Funding applications
Prospective observational protocol driven registry
Smaller randomised trials
Case control studies
Larger randomised trials
Larger observational registries
Overall assessment of safety, efficacy, resourse use and cost-effectiveness
In order to achieve these goals, we need to have reasonable evidence of safety, feasibility and efficacy on intermediate or mechanistic measures prior to embarking on large and resource hungry trials that may take several years to complete. In addition, it is almost impossible to obtain funding for larger randomised trials without this basic information. Examples of mechanistic variables include reduction of blood pressure to avoid future stroke, or measurement of international normalised ratio (INR) to monitor warfarin efficacy in patients with atrial fibrillation. Figure 5.1 shows a sequence of evaluation of potential new surgical treatments.
5.4 What Do I Need to Know Before Starting a Clinical Trial? There are two main aspects that need to be covered before a clinical trial programme can be set up. First, we need to know the “basics”, which are the building blocks of knowledge that underpin the scientific, philosophical, organisational and analytical aspects of clinical trials such as the hypothesis, randomisation, sample size and protocol (Fig. 5.2). Second, we need to establish “resources” made up of personnel with skills to lead a clinical trial (motivated and knowledgeable investigators), experts in design, management and analysis of trials (usually found in specialised statistical clinical trials units), personnel to identify, enrol and
Ingredients for a successful clinical trial Knowledge of the disease (pathophysiology, epidemiology)
Understanding how the treatment works
Measuring appropriate outcomes in the right way
Large enough study to detect plausible treatment differences
Fig. 5.2 Ingredients for a successful clinical trial. When identifying the question to be addressed in a clinical trial, certain key criteria need to be met to reduce the chance of failure. Most of these parameters are intuitive, but it is surprising how many trials are implemented when one or more of these criteria are not met
follow-up participating patients (usually found in hospitals or community care facilities) and funds to cover the costs of the clinical trial [9, 10]. We can deal with cost and organisational issues first, as these are conceptually simpler than the methodological issues involved. Clinical trials are by definition complicated ventures. It is almost impossible to fund even small clinical trials (e.g. 30–40 patients) using available health care resources. Funding to properly cover the
58
M. Flather et al.
costs of enrolling and following-up patients, study management (data management, study monitoring and statistical analysis) and additional tests for research purposes must be obtained in advance of a study starting. There are multiple sources of funding for clinical trials ranging from commercial companies, independent grant funders, to government organisations and private benefactors. Inevitably obtaining funds from any of these organisations is extremely competitive and timeconsuming. The time taken to prepare a good application including the scientific rationale, plan of investigation and costs is usually 6–12 months, and a further 1–2 years are needed for funds to be awarded if the application is successful, and this time scale seems to hold true irrespective of whether the funding application is made to a commercial or independent source. Any clinical research involving human subjects is governed by complicated ethical codes of conduct and laws (“regulations”). Most countries have government organisations to regulate the use of medicines and medical devices for clinical care and for research with the aim of trying to ensure that effective and safe medicines are developed and marketed. The most well known of these is the Food and Drug Administration (FDA) of the United States. In the UK, the Medicines and Healthcare Regulation Agency (MHRA) performs a similar role and the European Medicines Evaluation Agency (EMEA) has a growing role to regulate medicines across the European Community. Approval is
required from the appropriate national agencies for most clinical trials involving drugs or devices. Any clinical research involving human subjects requires approval from an independent, properly constituted research ethics committee [11]. The exact nature of these approvals varies from country to country, but the principles are the same. The final legal hurdle before starting a trial is approval from the institution where the research is being carried out. This usually follows signing of a contract with the sponsor of the trial. The sponsor is the legal entity which takes overall responsibility for the conduct and quality of the trial. Sponsors are often commercial companies, especially for drug or new device development trials, but increasingly Universities and academic hospitals are sponsoring clinical trials. In many cases, the sponsor may delegate the running of a trial to another group, for example, a contract research organisation in the case of a pharmaceutical company, or an academic clinical trials group for independent trials. In any case, the sponsor has a responsibility to ensure that the trial runs properly and complies with ethical and legal requirements. This is usually done by interim monitoring visits to participating sites to verify that informed consent has been obtained, the protocol is being followed and that other aspects are of high quality including drug storage, procedures and investigations and recording of data. Figure 5.3 summarises a number of key concepts in setting up a clinical trial.
Collaboration with Trials unit and Statistical team
Fig. 5.3 Key stages of starting a clinical trial. This diagram shows the main steps that need to be completed when starting a clinical trial. Collaboration with an experienced trials unit and statistical team is essential at the beginning of the process. Once funding has been obtained, there are many stages to complete, and it may take up to a year before the first patient can be enrolled. Trials comparing two accepted surgical procedures generally do not require regulatory approval, but these approvals are needed for new devices. For multi-centre trials, centres may start enrolment at very different times mainly due to delays in obtaining local agreements
Funding available
Full protocol written
Who is the sponsor?
Agree data collection method
Set up committees (Steering, Safety and Adjudication)
Regulatory approval
Ethical approvals
Sponsor approval
Agreements with other centres
Site Monitoring plan
Training of Sites
Supply of study materials
Local Site Approvals
Start of enrolment
5
Randomised Controlled Trials: What the Surgeon Needs to Know
5.5 How Do I Evaluate Surgical Procedures? We have established that a surgical procedure is a complicated intervention, and this poses a number of important challenges to the design and conduct of clinical research in surgery. Surgical procedures can only be carried out in settings that are able to support them. For example, even basic issues of aseptic technique, appropriate anaesthesia and post-operative care are usually found only in a hospital setting. Trained professionals are needed to carry out and support these procedures. Complex surgery such as neurosurgery, organ transplantation and cardiac surgery can only be carried in highly specialised centres. In addition, these procedures are expensive: costs include those of health care professionals, specialised equipment, hospital intensive care and medicines. Against this background it has been small wonder that the reliable evaluation of surgical procedures is methodologically difficult and many of the fundamental issues including study design, ethics of randomisation, blinding of treatments and definition of outcome measures are still in development [9, 10]. There is a lot of debate about whether surgical procedures should be evaluated through randomised trials because of potential logistic, ethical, methodological and funding issues. There is no simple answer to this question, but many surgical procedures, especially those that are well established, can certainly be compared with alternative treatment strategies [11, 12]. Problems arise when new or very complex surgical strategies are subjected to the rigorous environment of the randomised trial as the variability of the procedure itself (e.g. operator differences and variability in postoperative care) may be larger than the potential benefits that can be detected. These issues are discussed further below.
5.6 The Process of Surgical Evaluation As seekers of information and evidence, we must start with the premise that all health care treatments should be evaluated. The simplest evaluation, which should be routine for all patients and procedures, is a comprehensive but practical description of what has been
59
done, to whom and what the consequences were. This is the most basic of all evaluations: the collection of high quality observational data. In the UK, this is often called “audit”, and in North America, “outcomes research” [13, 14]. Sadly most health care systems fall far short of this basic criterion. In surgery, it is vital that key information from all operations is properly documented on an electronic database including subsequent outcomes of patients, that this data is checked for quality and tabulation, and that analyses and reports are produced by experts with relevant expertise. The definition of surgical mortality is vital status at 30 days, but this is a crude metric: patients should be followed for at least a year and preferably longer for major operations using national registers of death and health outcomes. Some operations are already recorded on national databases of cardiac surgery in the UK and North America [15–17]. A new surgical procedure can loosely be defined as one which involves an operative technique which has been developed relatively recently (e.g. within the previous 2 years) or one that involves a new piece of equipment (prosthesis or surgical instrument). These procedures should all start with a systematic evaluation in a prospective, observational protocol-driven registry (Fig. 5.1). This process involves writing a comprehensive protocol summarising the rationale for the new procedure, a comprehensive literature search, aims and hypotheses, eligible patients, description of the procedure and outcomes of interest. The outcomes will include simple clinical ones (death or major complications), but should also include more detailed evaluations of quality outcomes, e.g. function after hip replacement using a new prosthesis or haemodynamic characteristics of replacement heart valves using echocardiography. These protocols should undergo peer review and approval for ethical, practical, cost and safety considerations. Periodic review of the results should be made and general conclusions about safety and efficacy can be made. In its simplest form, there is no control group for these protocol-driven registries, and therefore, the next reasonable step is to add a control group, and the simplest method is to use a case control design [18]. In this study, design information from concurrent “control” patients receiving a more conventional surgical procedure for the same condition is also collected and evaluated. Simple comparisons can be made between patients who have similar characteristics based on age,
60
gender, severity of disease, etc. The case control design is of course prone to many biases, not least that the cases and controls may differ in important ways, but it is simple to carry out and does not require complex preparation, approval or costs associated with randomised trials. Case control studies for surgical procedures serve us best if they are carried out within the same institution, but if this is not possible, seeking cases and controls from different institutions is also helpful [19]. Probably the least useful is the use of historical controls (patients who have had procedures in the year or two previously) as this introduces many more biases, in particular the cases and historical controls are likely to be different due to changes in treatment over time which can lead to very misleading conclusions. Case control studies can provide relatively reliable information on the length of procedures, resource use and outcome measures, but can rarely provide conclusive information that the new procedure is better than an existing one. It is fair to say that “traditional” evaluation of surgical procedures has involved case series (a simpler form of the protocol-driven registry) and loose comparisons with previous case series which is a type of case control design, but our recommendations take these traditional study designs to more rigorous and modern standards. Where can we go beyond the case control design? The key question that needs to be answered after the case control study is: “is the procedure ready for routine use or are there fundamental aspects that need refining?” Associated with this question is the whole issue of the “learning curve”. The learning curve has not been properly defined, but it is a concept familiar to all surgeons and those evaluating surgical procedures [20, 21]. In its basic form, this refers to the time taken to become familiar with the new procedure both from the operative point of view and anaesthetic and postoperative care. We are assuming that surgeons, operating staff and anaesthetists are all experienced so the learning curve refers to the period of assimilation of a new procedure by highly trained health professionals. We do not really know when the learning curve has finished because, of course, we are always “learning” and even established procedures are always being refined, and the learning process will be different for each new procedure. However, experience and judgement can tell us when we have mastered the basic issues of a new procedure, and ideally, these parameters should be prespecified when we start the programme.
M. Flather et al.
5.7 What Do I Do When the Learning Curve Has Been Completed and the New Procedure Appears Feasible and Safe? When a new procedure appears to be feasible and safe and to offer genuine advantages over the more conventional approaches in an ideal world, it should be subjected to a robust comparison with the existing procedure or treatment. A randomised comparison between two surgical procedures is generally quite feasible if sufficient numbers can be entered into a study to make the results meaningful. To determine the effects of one procedure vs. another on mortality or major clinical outcomes (serious infections, myocardial infarction, stroke, cancer recurrence etc) may require large numbers of patients which is often impractical under present funding and organisational systems. A simple example of a barrier to larger surgical randomised trials is the difficulty of obtaining independent (non-commercial) funds to carry out studies in different countries. The National Institutes of Health has an established funding mechanism in the United States to support larger studies of surgical treatments. Procedures for supporting multi-national clinical trials in the European Union are evolving, but the amount of financial support for these studies is still relatively small. Companies developing new devices for use in surgery are not able to invest large sums of money that are often spent on pharmaceutical drug development because the markets for surgical products are much smaller and the devices expensive to provide for larger randomised trials. In spite of these funding issues, there have been a number of successful large randomised trials in surgery including trials of carotid endarterectomy and coronary revascularisation [22–25].
5.8 Randomised Trials of Surgical Procedures Randomised trials of any complex treatment, especially surgical interventions, poses a number of additional methodological, design and ethical issues compared to simpler pharmacological treatments [26]. Prior to planning a randomised comparison of a promising new surgical procedure, we need to satisfy four major, related questions:
5
Randomised Controlled Trials: What the Surgeon Needs to Know
1. Do the results of a carefully conducted observational protocol-driven registry (case series) satisfy the basic criteria of feasibility and safety? Pragmatically, we might say that between 50 and 100 such cases need to be carried to have any hope of providing reliable observational data. 2. Is there sufficient experience of carrying out the new procedure to allow it to be tested against established surgical or non-surgical alternatives (have we gone beyond the learning curve?)? 3. Is there sufficient enthusiasm and support from the surgical community to introduce the new technique and, therefore, subject it to reliable evaluation in a randomised trial? 4. What will be the comparator for the new technique? Will it be another surgical or interventional treatment or “medical” therapy? Identifying the comparator group is often more difficult than it seems. When the new procedure is a variation of a previous operation, or involves a new prosthesis or device, it may be relatively simple to select an appropriate comparator. For example, insertion of a new aortic
61
valve may be simply compared against implanting the conventional valve, resection of a bowel tumour using different operative methods new vs. old. In these cases, the only real variation between the new and established procedure is the difference in technique or device – the pre-operative work-up, anaesthesia, basic operative methods, use of circulatory support, recovery methods and post-operative care are essentially the same. Thus, any variations in outcomes may be reasonably attributed to the new surgical methods. Problems of comparison arise when the two surgical methods vary considerably (more invasive vs. less invasive), if a surgical procedure is compared to a percutaneous procedure, or even more problematic if a surgical procedure is compared to medical therapy. Figure 5.4 summarises some of the concepts of study design and comparator groups for surgical studies. In these situations, there are many variables involved, but the way we generally manage these comparisons is whether the treatments given to the two groups are regarded as “black boxes”, i.e. the multiple variables between the two treatments are ignored, and whether we simply compare the outcomes between the two groups
Surgery vs medical
Comparison group
More complicated Surgery vs other intervention Surgery vs surgery Case matched control
No Control
Less complicated
Case series
Protocol driven registry
Case control study
Small randomised trial
Large randomised trial
Study Design Fig. 5.4 Schematic interaction of study design and comparison group on the evaluation of surgical procedures. This figure shows that large randomised trials are the most complicated study design to implement in surgery (X axis), and that comparing surgery vs. no surgery is difficult to implement in practice. A hypothetical
relationship is proposed between the complexity of study design and the feasibility of the comparison group. The comparison of one surgical procedure to another similar one (e.g. comparison of two surgical methods to remove a bowel tumour) is the most feasible to implement in practice in a randomised controlled trial
62
as we would for a comparison of a pharmacological treatment vs. placebo. Clearly comparing a surgical procedure against medical therapy is a difficult task, not least because in clinical practice, we would rarely give a patient a true choice between a surgical operation and no operation. A full discussion of these issues is beyond the scope of this chapter.
5.9 Selection of Outcomes for Surgical Trials Evaluation of new surgical procedures should follow the simple rules for all new treatments: 1. Understanding how the new procedure might provide additional benefits from experimental models (in vitro, laboratory and animal models) 2. Safety and feasibility in patients 3. Measures of mechanistic improvements specific for that procedure (e.g. for new heart valves better haemodynamics, or for knee replacement better mobility) and general improvements (less infections, less blood loss, shorter operations, quicker recovery times, etc.) 4. Improved clinical outcomes (better survival, reduced complications, better quality of life, etc.) in properly controlled studies Appropriate selection of outcomes is very important for all research studies. When attempting to demonstrate that a new surgical procedure is potentially clinically worthwhile, demonstration of “mechanistic efficacy” is critical in addition to safety and feasibility. Thus, if we are developing a new knee joint, we must, as a very basic goal, demonstrate that it can be implanted safely (similar, or fewer, operative and post operative complications than the standard existing knee prostheses) and feasibility (the operation does not require extra staff, unreasonable extra operative time or is much more costly). In addition, the functional status of patients with the new joint must be compared rigorously with patients with the standard knee joint. Outcome measures for this might include recovery time, presence of pain, extent of knee movement, walking ability and durability over a reasonable time period (e.g. 1 year) and costs. The most robust way to do this is in a randomised trial, but a carefully designed case control study could provide reasonable evidence while randomised trials are being prepared. For conditions
M. Flather et al.
which are non-life threatening like knee and hip degeneration, properly designed randomised trials of mechanistic outcomes and costs will generally be sufficient to provide a level evidence to inform a decision to introduce the new procedure or not. Issues of subgroup analysis and external validity of surgical trials are also beyond the scope of this chapter but are discussed in detail in several articles [26–29]. Health economic issues are increasingly influencing decisions regarding whether to introduce new surgical procedures or not. Common sense tells us that we should weigh up the costs and effectiveness of new surgical procedures just as we would when buying a new car (although we might also weigh up the “status” effect of a more expensive car as well as its “effectiveness” at transporting us from A to B). How we go about doing this is complicated by the fact that metrics for measuring cost-effectiveness are still evolving [30, 31]. The quality adjusted life year (QALY) is proving to be the most popular, and in theory, can be applied to all health care outcomes irrespective of the disease or intervention [32]. From the practising surgeon’s point of view, the most important aspect is to collect information on the use of key “cost-drivers”, i.e. those aspects of health care that make up most of the costs of a surgical procedure. These are some of the common cost drivers for surgical procedures: 1. 2. 3. 4. 5. 6. 7.
Staff time Use of the operating room Disposable equipment Prosthesis, implants or devices Intensive or high dependency care Length of hospital stay Expensive associated treatments (antibiotics, immunosuppressants, blood products)
To perform a detailed analysis of all potential costs is in itself a costly and time consuming (the so-called “bottom-up” cost analysis), so most of the time, the most expensive cost drivers are identified and are used to calculate costs. For surgical procedures that are designed to treat common life threatening illnesses like cancer or cardiovascular diseases, it is appropriate to select important adverse clinical outcomes for randomised trials that compare new procedures with existing ones. Common clinical outcomes include survival, freedom from disease recurrence (e.g. cancer or cardiovascular events) and major complications (bleeding, infection,
5
Randomised Controlled Trials: What the Surgeon Needs to Know
bowel obstruction). The selection of outcome is specific to the disease being managed and the procedure under evaluation. Clearly it would be inappropriate to select survival as the main outcome for surgical procedures that are designed to improve quality of life such as knee and hip surgery.
63
We have summarised the main issues in the subsections below.
time of day or night. Similarly data collection using web-based electronic forms is also becoming more common. One of the key issues in the randomisation process is that in some cases eligibility can only be determined during the operative procedure. In this case we recommend that patients are screened prior to surgery as usual, informed consent obtained and any preoperative tests carried out. The patient is then registered as being “trial eligible”, but not randomised [34]. During the operation, if eligibility is confirmed, the patient can then be randomised by telephone or internet. This process ensures that study drop-outs (patients randomised but who do not receive their allocated procedure) are kept to a minimum.
5.10.1 Eligibility Criteria
5.10.3 Blinding of Treatment Allocations
Patients undergoing surgery for “routine” clinical indications are already subjected to selection criteria (can they withstand an anaesthetic, do they have particular high-risk features, etc.) and when enrolled in a clinical trial, further selection criteria are used. Thus, the generalisability of a surgical study (ability to apply the results outside of the trial to more general clinical populations) may be quite limited [27, 29]. Thus, it is important to keep the eligibility criteria as broad as possible, and this is also important to maximise enrolment and achieve the planned sample size.
It is generally accepted that evaluation of surgical procedures will not be done in a blinded manner, i.e. the investigator and the patient know which procedure has been carried out [12]. This can obviously lead to bias in assessing outcome measures unless these are very robust like mortality. For mechanistic outcomes such as walking ability, heart function, lung function, etc., assessments in an “open label” study (where allocations are known) need to be made by observers blinded to the original allocation using a PROBE (prospective observational blinded endpoint) design [35].
5.10.2 Process of Randomisation
5.10.4 Sample Size and Enrolment Issues
Randomisation should be carried by a unit with expertise in this area. Methods using envelopes are considered to be obsolete because of the high rates of tampering and bias. Randomisation methods using telephone or internet are considered to be standard [33]. Investigators can either call the randomisation centre and speak to an individual who will provide the randomisation allocation, or receive an allocation using an automated telephone-based randomisation system which requires keying in numeric information before the allocation is provided. Web-based systems are gaining popularity and investigators can obtain the required study treatment allocation by entering their study site details and a basic eligibility checklist any
Surgical procedures usually require substantial resources and most are carried out in the hospital setting. More complex surgical procedures, which require evaluation in clinical trials, may be uncommon, so the notion that we can perform large studies of complex surgical procedures assessing their impact on robust clinical outcomes is often not feasible, although large randomised trials in surgery have been carried out [36, 37]. Thus, we need innovative study designs and appropriate robust outcomes to evaluate complex surgical interventions. Unfortunately this is precisely an area where we do not have the methodological solutions to the problem. The problem is partly alleviated if the complex procedures are relatively common such
5.10 Practical Issues for Designing Randomised Trials in Surgery
64
as coronary artery bypass grafting, carotid endarterectomy, hip replacement or bowel resection for cancer. Enrolment in surgical trials is also a challenge and for most studies it is important to obtain agreement from as many centres as possible to collaborate in a multicentre study to help achieve the required sample size.
5.10.5 Costs of Doing Surgical Trials Administrative, ethical and cost issues are a great barrier to the implementation of surgical trials. Health care providers may demand reimbursement for some or all of the costs of surgical procedures and devices when they are evaluated in clinical trials. We need processes to make it much easier to carry out high quality randomised trials in surgery. One proposal is to have an ongoing register of planned or ongoing randomised studies of surgical procedures on a national or international basis, and all centres with the necessary expertise are notified and invited to participate. When a patient is scheduled for an operation, they are considered for a trial if one exists for that procedure. In this scenario, additional costs of running a trial at a centre over above routine care are not really required, and simple and limited data collection can be carried using a web-based interface.
5.10.6 Balance of Benefits and Risk Elective surgical procedures almost always carry a higher earlier risk than not carrying out any procedure. The benefits of surgical procedures accrue over time, hence the need to follow patients properly. Some complex, high-risk surgical procedures are carried out for palliation, symptom control and because there are very few other alternatives for patients. In these situations, trying to prove that the procedure prolongs life or improves outcomes can be difficult because of the high early morbidity and mortality associated with the procedure itself. Some examples include surgery for major trauma, repair of ruptured aortic aneurysm and surgery for symptomatic cancers (“bulk removal”) where the chance of cure is small. Trying to prove one strategy is better than another can be very difficult and a lot of thought and time needs to go into the design and
M. Flather et al.
co-ordination of appropriate clinical trials and other evaluations, and the appropriate use of surrogate outcomes to determine of larger studies is warranted [38].
5.11 Conclusions Following traditional rules for the design and conduct of clinical trials in surgery is intuitively attractive, but there are many hurdles related to the complexity of the interventions being assessed, associated costs of running trials and major design and ethical issues including the ability to enrol enough patients into surgical trials. These problems should not be an absolute barrier to surgical trials, but they should be recognised by those involved in designing and running these trials as well as those who review grant applications and manuscripts submitted for publication. The opinion of experts in trial design, management and analysis should be sought at an early stage of a surgical clinical trial programme, recognising that other study designs may also be important including the prospective protocoldriven registry and the case control study. However, the basic elements of every surgical procedure should be recorded (patient characteristics, procedural details, associated treatments and short and long-term outcomes). These data should be pooled in national and international registries to evaluate the current standards of care and provide hypotheses for improvement. Similarly the results of smaller surgical trials can be carefully pooled, ideally using individual patient data, in collaborative meta-analyses [39, 40]. Without these basic steps the careful evaluation of surgery cannot advance, and the development of large-scale clinical trials for the evaluation of selected surgical procedures may be impossible.
References 1. Peto R, Collins R, Gray R (1995) Large-scale randomized evidence: large, simple trials and overviews of trials. J Clin Epidemiol 48:23–40 2. Horton R (1996) Surgical research or comic opera: questions, but few answers. Lancet 347:984–985 3. Sculpher MJ, Seed P, Henderson RA et al (1994) Health service costs of coronary angioplasty and coronary artery bypass surgery: the Randomised Intervention Treatment of Angina (RITA) trial. Lancet 344:927–930
5
Randomised Controlled Trials: What the Surgeon Needs to Know
4. Yusuf S, Zucker D, Peduzzi P et al (1994) Effect of coronary artery bypass graft surgery on survival: overview of 10-year results from randomised trials by the Coronary Artery Bypass Graft Surgery Trialists Collaboration. Lancet 344: 563–570 5. Rahbari NN, Diener MK, Fischer L et al (2008) A concept for trial institutions focussing on randomised controlled trials in surgery. Trials 9:3 6. Garrett MD, Walton MI, McDonald E et al (2003) The contemporary drug development process: advances and challenges in preclinical and clinical development. Prog Cell Cycle Res 5:145–158 7. Feldman AM, Koch WJ, Force TL (2007) Developing strategies to link basic cardiovascular sciences with clinical drug development: another opportunity for translational sciences. Clin Pharmacol Ther 81:887–892 8. Boutron I, Ravaud P, Nizard R (2007) The design and assessment of prospective randomised, controlled trials in orthopaedic surgery. J Bone Joint Surg Br 89:858–863 9. Balasubramanian SP, Wiener M, Alshameeri Z et al (2006) Standards of reporting of randomized controlled trials in general surgery: can we do better? Ann Surg 244:663–667 10. Tiruvoipati R, Balasubramanian SP, Atturu G et al (2006) Improving the quality of reporting randomized controlled trials in cardiothoracic surgery: the way forward. J Thorac Cardiovasc Surg 132:233–240 11. Burger I, Sugarman J, Goodman SN (2006) Ethical issues in evidence-based surgery. Surg Clin North Am 86:151–168; x 12. Boyle K, Batzer FR (2007) Is a placebo-controlled surgical trial an oxymoron? J Minim Invasive Gynecol 14:278–283 13. Mann CJ (2003) Observational research methods. Research design II: cohort, cross sectional, and case-control studies. Emerg Med J 20:54–60 14. Wolfe F (1999) Critical issues in longitudinal and observational studies: purpose, short versus long term, selection of study instruments, methods, outcomes, and biases. J Rheumatol 26: 469–472 15. Bridgewater B, Grayson AD, Brooks N et al (2007) Has the publication of cardiac surgery outcome data been associated with changes in practice in northwest England: an analysis of 25,730 patients undergoing CABG surgery under 30 surgeons over eight years. Heart 93:744–748 16. Ferguson TB Jr, Dziuban SW Jr, Edwards FH et al (2000) The STS national database: current changes and challenges for the new millennium. Committee to Establish a National Database in Cardiothoracic Surgery, The Society of Thoracic Surgeons. Ann Thorac Surg 69:680–691 17. Keogh BE, Bridgewater B (2007) Toward public disclosure of surgical results: experience of cardiac surgery in the United Kingdom. Thorac Surg Clin 17:403–411; vii 18. Chautard J, Alves A, Zalinski S et al (2008) Laparoscopic colorectal surgery in elderly patients: a matched case-control study in 178 patients. J Am Coll Surg 206:255–260 19. Zondervan KT, Cardon LR, Kennedy SH (2002) What makes a good case-control study? Design issues for complex traits such as endometriosis. Hum Reprod 17:1415–1423 20. Moran BJ (2006) Decision-making and technical factors account for the learning curve in complex surgery. J Public Health (Oxf) 28:375–378 21. Murphy GJ, Rogers CA, Caputo M et al (2005) Acquiring proficiency in off-pump surgery: traversing the learning
65
curve, reproducibility, and quality control. Ann Thorac Surg 80:1965–1970 22. Anon (1998) Randomised trial of endarterectomy for recently symptomatic carotid stenosis: final results of the MRC European Carotid Surgery Trial (ECST). Lancet 351: 1379–1387 23. Anon (2002) Coronary artery bypass surgery versus percutaneous coronary intervention with stent implantation in patients with multivessel coronary artery disease (the Stent or Surgery trial): a randomised controlled trial. Lancet 360: 965–970 24. Barnett HJ, Taylor DW, Eliasziw M et al (1998) Benefit of carotid endarterectomy in patients with symptomatic moderate or severe stenosis. North American Symptomatic Carotid Endarterectomy Trial Collaborators. N Engl J Med 339: 1415–1425 25. Serruys PW, Unger F, Sousa JE et al (2001) Comparison of coronary-artery bypass surgery and stenting for the treatment of multivessel disease. N Engl J Med 344:1117–1124 26. Rothwell PM, Mehta Z, Howard SC et al (2005) Treating individuals 3: from subgroups to individuals: general principles and the example of carotid endarterectomy. Lancet 365:256–265 27. Flather M, Delahunty N, Collinson J (2006) Generalizing results of randomized trials to clinical practice: reliability and cautions. Clin Trials 3:508–512 28. Rothwell PM (2005) Treating individuals 2. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet 365:176–186 29. Rothwell PM (2005) External validity of randomised controlled trials: “to whom do the results of this trial apply?” Lancet 365:82–93 30. Aziz O, Rao C, Panesar SS et al (2007) Meta-analysis of minimally invasive internal thoracic artery bypass versus percutaneous revascularisation for isolated lesions of the left anterior descending artery. BMJ 334:617 31. Rao C, Aziz O, Panesar SS et al (2007) Cost effectiveness analysis of minimally invasive internal thoracic artery bypass versus percutaneous revascularisation for isolated lesions of the left anterior descending artery. BMJ 334:621 32. McNamee P (2007) What difference does it make? The calculation of QALY gains from health profiles using patient and general population values. Health Policy 84:321–331 33. Vaz D, Santos L, Machado M et al (2004) Randomization methods in clinical trials. Rev Port Cardiol 23:741–755 34. Perez de Arenaza D, Lees B, Flather M et al (2005) Randomized comparison of stentless versus stented valves for aortic stenosis: effects on left ventricular mass. Circulation 112:2696–2702 35. Smith DH, Neutel JM, Lacourciere Y et al (2003) Prospective, randomized, open-label, blinded-endpoint (PROBE) designed trials yield the same results as double-blind, placebo-controlled trials with respect to ABPM measurements. J Hypertens 21:1291–1298 36. Qureshi AI, Hutson AD, Harbaugh RE et al (2004) Methods and design considerations for randomized clinical trials evaluating surgical or endovascular treatments for cerebrovascular diseases. Neurosurgery 54:248–264; discussion 264–267 37. Taggart DP, Lees B, Gray A et al (2006) Protocol for the arterial revascularisation trial (ART). A randomised trial to compare
66 survival following bilateral versus single internal mammary grafting in coronary revascularisation [ISRCTN46552265]. Trials 7:7 38. Sellier P, Chatellier G, D’Agrosa-Boiteux MC et al (2003) Use of non-invasive cardiac investigations to predict clinical endpoints after coronary bypass graft surgery in coronary artery disease patients: results from the prognosis and evaluation of risk in the coronary operated patient (PERISCOP) study. Eur Heart J 24:916–926
M. Flather et al. 39. Lim E, Drain A, Davies W et al (2006) A systematic review of randomized trials comparing revascularization rate and graft patency of off-pump and conventional coronary surgery. J Thorac Cardiovasc Surg 132:1409–1413 40. Sedrakyan A, Wu AW, Parashar A et al (2006) Off-pump surgery is associated with reduced occurrence of stroke and other morbidity as compared with traditional coronary artery bypass grafting: a meta-analysis of systematically reviewed trials. Stroke 37:2759–2769
6
Monitoring Trial Effects Hutan Ashrafian, Erik Mayer, and Thanos Athanasiou
Contents 6.1
Introduction ..............................................................
67
6.2
Definition and Development of DMCs ....................
68
6.3
Roles of the Committee ............................................
68
6.4
DMC Members .........................................................
69
6.5
Conclusion .................................................................
70
References ...........................................................................
73
Abstract In order to address difficulties in running RCTs, clinicians introduced the concept of a totally objective group of assessors to continually evaluate and review trial results and execution. These groups are known as Data Monitoring Committees (DMC) and have been set up to minimise trial complications while also optimizing trial implementation. Furthermore, as it quickly became apparent that the results of some trials had significant implications for patient care before trials had been completed, the DMCs who are responsible for the continual review of trial data also adopted the power to stop or advise extension of each experiment if deemed scientifically necessary. This chapter delineates the role of these committees, and also clarifies some of their functions through some surgical trial examples.
6.1 Introduction
H. Ashrafian () The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK e-mail:
[email protected] Randomised clinical trials (RCTs) are regarded as the “gold standard” model in order to answer clinical questions in surgery. The benefits of this type of study are that they help to accurately define the differences in patient outcomes according to the various treatment arms to which the patients belong; if numbered adequately, they can reveal subtle differences in outcomes and have the added advantage of limiting the effect of bias in each experiment. As study trialists have become more familiar with the workings and intricacies of running RCTs, the nature of the experiments undertaken has become larger and more complex. This has led to two fundamental issues in trial execution, namely
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_6, © Springer-Verlag Berlin Heidelberg 2010
67
68
H. Ashrafian et al.
• Increased difficulty for one group of clinicians to execute a large trial (both in terms of manpower and academic support) • Increased difficulty for one group clinicians to adequately interpret the incoming results of a large trial
6.2 Definition and Development of DMCs The United States Food and Drug Administration (FDA) clearly specifies the definition of a DMC: “A data monitoring committee is a group of individuals with pertinent expertise that reviews on a regular basis accumulating data from an ongoing clinical trial”. The DMC advises the sponsor regarding the continuing safety of current participant and those yet to be recruited, as well as the continuing validity and scientific merit of the trial [1]. These committees are increasingly recognised through a number of alternate titles (Table 6.1). Table 6.1 Alternative titles for Data Monitoring Committee’s (DMCs) Data Monitoring Committee’s (DMCs) Data Review Board (DRB) Data and Safety Monitoring Committee (DSMB) Independent Data Monitoring Committee (IDMC) Treatment Effects Monitoring Committee (TEMC) Safety Monitoring Committee (SMC) Policy Advisory Board (PAB) Policy Board (PB) Policy and Data Board (PDB)
The role of these committees was first proposed in the 1960s by the Greenberg Report of the National Heart Institute in order to aid in the setting up and execution of large clinical trials [2]. Here, they recommended the use of “an oversight Policy Advisory Board”, who would review the policies and progress of the trial, advising both the trial group, while also communication with the trial sponsor, which at the time was the NIH. Members of the Policy Advisory Board were to be independent of the sponsor, and thus, could provide objective external advice for the trialists. Therefore, in essence, they were well-informed, objective communicators and advisors acting in between the sponsors of the trial and those taking part in it. The first formal use for such as committee was in the 1980s by the Coronary Drug Project Policy Advisory Board [3, 4], who set up a subcommittee to review the accumulating safety and efficacy data for the trial. Since that trial, DMCs have been a fundamental component of nearly every single large RCT performed worldwide.
6.3 Roles of the Committee The DMC is an autonomous committee that ensures that both project and patient standards are regulated and upheld during a clinical trial (Fig. 6.1). Furthermore, it is a body that can communicate its decisions and findings to the trial sponsor and the investigators. The National Institutes of Health stipulates that trial monitoring "should be commensurate with risks", and that a “monitoring committee is usually required to determine safe and effective conduct and to recommend conclusion of the trial when significant benefits or risks have developed” [5]. The National Cancer Institute Ensuring adherence to experimental protocol Modification of experimental protocol (if necessary) Ensuring adequate patient numbers in the study
Monitoring adverse effects Monitoring follow-up
Ensuring project time keeping −
Project − Ensuring accuracy of data entry
Patient
Ensuring patient care is upheld
Stopping or halting experiment (if necessary) Upholding ethics in project
Data Monitoring Committee Project investigators Project sponsor Media
−
Dialogue
Fig. 6.1 Roles of the data monitoring committee
Supporting project completion
6
Monitoring Trial Effects
Fig. 6.2 Clinical trial review by the data monitoring committee
69
Idea + Question
Design Clinical trial
Awarded Grant
Start Trial
Interim Analysis 1
Data Monitoring Committee
Review
Interim Analysis 2
Interim Analysis N
End Trial
[6, 7] also stated that DMCs should be independent of study leadership and free from conflicts of interest. They are also required to • Familiarise themselves with the research protocol(s) and plans for data and safety monitoring. • Ensure reporting results are made competently. • Ensure rapid communication of adverse event reporting and treatment-related morbidity information. • Perform periodic evaluation of outcome measures and patient safety information. • Ensure that patients in the clinical trial are protected. • Review interim analyses of outcome data (Fig. 6.2) to determine whether the trial should continue as originally designed, should be changed, or should be terminated based on these data. • Review trial performance information such as accrual information. • Determine whether and to whom outcome results should be released prior to the reporting of study results. • Review reposrts of related studies to determine whether the monitored study needs to be changed or terminated. • Review major proposed modifications to the study prior to their implementation (e.g. termination, dropping an arm based on toxicity results or other reported trial outcomes, increasing target sample size).
• Following each DMC meeting, provide the study leadership and the study sponsor with written information concerning findings for the trial: For example, any relevant recommendations related to continuing, changing, or terminating the trial. • Perform ongoing assessment of patient eligibility and evaluability. • Assure that the credibility of clinical trial reports and the ethics of clinical trial conduct are above reproach.
6.4 DMC Members DMC members are selected by the principal investigator, project manager or appointed designee at the design of a surgical trial. These should include: • • • • •
Surgeons Physicians Statisticians Other scientists Lay representatives
These individuals are to be selected on their experience, reputation for objectivity, absence of conflicts of interest and knowledge of clinical trial methodology. Furthermore, a DMC chairperson who has a tenured
70
history of experience in clinical trials should be picked. Members can belong to the institution performing the trial, although the majority should be externally appointed [6]. Although there is currently no formal requirement to be trained to be a member of a DMC, this is now changing so that those selected on the committee have a taught comprehension of the idiosyncrasies of a trial. Thus, for example their understanding and management of serious, unexpected or unanticipated adverse effects during a trial can be well rehearsed. Training could take the form of study of previous trials, specialist courses in statistics or formal university courses aimed for DMC members. In order to familiarise the readers to some examples of the roles of DMCs, we have listed six examples where DMCs played a prominent role in a surgical trial (please see below). As statisticians have the duty of preparing and presenting interim analyses to the DMC, it has been argued that for an industry funded trial, these experts may have the potential to display bias in their presentation of results. This has led to the concept of having an independent external statistician preparing the interim analyses, thereby reducing inadvertent bias [8].
6.5 Conclusion DMCs are now an integral part of any clinical trial. They are independent body that objectively reviews study protocols and results during a trial (interim reviews). They have the further ability of suggesting and initiating any required changes to the study including its termination. Furthermore, they uphold the quality of care to patients and communicate their findings with the study sponsor and study investigators. The placing of surgeons as DMC members has until now been scarce. However, with the increasing number of surgical clinical trials, this has become a necessary responsibility for academic surgeons, and will continue to rise in the future. Training for DMC membership is becoming increasingly applied, and includes familiarity of the role of DMCs in previous trials. As a result, we conclude this chapter by describing six case studies where DMCs have played a prominent role in surgical trials. Surgical case studies 1. DMC ending trial recruitment due to a difference in two surgical treatment arms.
H. Ashrafian et al.
2. DMC ending trial recruitment due to a difference in two pre-surgical treatment arms. 3. DMC ending trial recruitment due to difference noted in a subgroup of subjects. 4. DMC ending trial recruitment as a result of subjects dropping out of a trial. 5. DMC requesting re-evaluation of trial subjects. 6. DMC altering a study protocol.
Case study 6.1 DMC ending trial recruitment due to a difference in two surgical treatment arms Trial name
The Leicester randomised study of carotid angioplasty and stenting vs. carotid endarterectomy [9]
Null hypothesis/ objective
Is Carotid angioplasty (CA) a safer and a more cost-effective alternative to carotid endarterectomy (CEA) in the management of symptomatic severe internal carotid artery (ICA) disease?
Trial methods
RCT comparing carotid angioplasty and stenting vs. carotid endarterectomy for patients with symptomatic severe ipsilateral internal carotid artery (ICA) stenosis (70–99%) in a university teaching hospital
Treatment arms
Carotid angioplasty (CA) and optimal medical therapy vs. carotid endarterectomy (CEA) and optimal medical therapy
Follow-up
Patients were examined by a consultant neurologist 24 h after intervention, and any new neurological deficit was recorded. A stroke was defined as any new neurological deficit persisting for more than 24 h. The neurologist reassessed all patients at 30 days
Endpoints
Death, disabling or non-disabling stroke within 30 days
Results
All ten CEA operations proceeded without complication, but 5 of the 7 patients who underwent CA had a stroke (P = 0.0034), 3 of which were disabling at 30 days
Role of the DMC
After referral and review of these results, the Data Monitoring Committee invoked the stopping rule and the trial was suspended. Re-evaluation with the DMC, ethics and the trialists led to the study being restarted, primarily due to an issue of trial methodology as a result of problems associated with informed consent
6
Monitoring Trial Effects
Case study 6.2 DMC ending trial recruitment due to a difference in two pre-surgical treatment arms
71 Trial methods
RCT enrolling healthy sexually active men 18 years and older who had had chosen vasectomy for contraception. The trial was set up to compare two arms, namely ligation/excision with vs. without fascial interposition (FI), a technique in which a layer of the vas sheath is interposed between the cut ends of the vas
Trial name
GeparDUO of the German Breast Group [10]
Null hypothesis/ objective
Is the combination chemotherapy regimen ddAT (combined doxorubicin and docetaxel) capable of obtaining similar rates of pathologic complete remission (pCR) as the AC-DOC regimen (sequential doxorubicin/cyclophosphamide followed by docetaxel), in patients with primary operable breast cancer when used as a neoadjuvant treatment
Treatment arms
Vasectomy with and without fascial interposition (FI)
Follow-up
RCT enrolling pre-operative patients with primary operable breast cancer (T2–3 N0–2 M0) confirmed histologically by core or true-cut biopsy. Patients would receive either ddAT or AC-DOC and would then undergo breast surgery
Semen collections by blinded technicians, occurring every 4 weeks post-operatively until week 34 or until azoospermia. All participants were to be evaluated again 12 months after surgery
Endpoints
The primary endpoint was azoospermia in the first of two consecutive semen samples after surgery. Secondary outcomes included surgical difficulties and the occurrence of adverse events
Interim results
Data were analysed for 552 vasectomised men (276 in each technique group); 414 of whom had completed at least 10 weeks of follow-up were reviewed
Trial methods
Treatment arms
ddAT neoadjuvant chemotherapy vs. AC-DOC neoadjuvant chemotherapy
Follow-up
Breast surgery outpatient follow-up
Endpoints
The primary endpoint was defined as no microscopic evidence of viable tumour (invasive and non-invasive) in all resected breast specimens and axillary lymph nodes. Secondary aims are to determine disease-free survival, overall survival rates, local tumour and lymph node response
Interim results
The first interim analysis of 208 patients showed a considerable difference in the rate of the primary endpoint (pCR) in treatment groups
Role of the DMC
The DMC recommended a second interim analysis for the primary endpoint. Based on this second analysis of 395 patients, the DMC recommended stopping enrolment into the study. Due to their recommendation, only 913 of 1,000 planned patients were included in this trial, following which recruitment was halted. Treatment was continued in the patients who were already enrolled in the trial
Case study 6.3 DMC ending trial recruitment due to difference noted in a subgroup of subjects Trial Name
Null hypothesis/ objective
Family Health International (FHI) & EngenderHealth multicenter RCT evaluating fascial interposition (FI) as a component of a vas occlusion [11] The Hazard Ratio for achieving the primary endpoint of azoospermia in patients undergoing vasectomy with FI vs. the non-FI group is 1.0
The overall HR was significant (HR = 1.54, P < 0.01). However, the estimate of the HR was significantly greater than 1.0 for the age 5.69 mmol/L (220 mg/dL) A plasma cholesterol >5.17 mmol/L (200 mg/dL) if LDL >3.62 mmol/L (140 mg/dL) These changes were implemented, which resulted in the required increase in patient inclusion into the study. As a result, the trial was successfully completed
73
References 1. United States Food and Drug Administration (2001) Guidance for clinical trial sponsors: on the establishment and operation of Clinical Trial Data Monitoring Committees. Available at: http://www.fda.gov/Cber/gdlns/clindatmon.htm 2. Anon (1988) Organization, review, and administration of cooperative studies (Greenberg Report): a report from the Heart Special Project Committee to the National Advisory Heart Council, May 1967. Control Clin Trials 9:137–148 3. Anon (1973) The coronary drug project. Design, methods, and baseline results. Circulation 47:I1–50 4. Canner PL, Berge KG, Wenger NK et al (1986) Fifteen year mortality in Coronary Drug Project patients: long-term benefit with niacin. J Am Coll Cardiol 8:1245–1255 5. National Institutes of Health (NIH) (1998) NIH policy for data and safety monitoring. Available AT: http://grants.nih. gov/grants/guide/notice-files/not98–084.html 6. National Cancer Institute (1999) Policy of the national cancer institute for data and safety monitoring of clinical trials. Available at: http://deainfo.nci.nih.gov/grantspolicies/datasafety.htm 7. National Cancer Institute (2006) Guidelines for monitoring of clinical trials for Cooperative Groups, CCOP research bases, and The Clinical Trials Support Unit (CTSU). Available at: http://ctep.cancer.gov/monitoring/2006_ctmb_ guidelines.pdf 8. DeMets DL, Fleming TR (2004) The independent statistician for data monitoring committees. Stat Med 23: 1513–1517 9. Naylor AR, Bolia A, Abbott RJ et al (1998) Randomized study of carotid angioplasty and stenting versus carotid endarterectomy: a stopped trial. J Vasc Surg 28:326–334 10. Jackisch C, von Minckwitz G, Eidtmann H et al (2002) Dose-dense biweekly doxorubicin/docetaxel versus sequential neoadjuvant chemotherapy with doxorubicin/cyclophosphamide/docetaxel in operable breast cancer: second interim analysis. Clin Breast Cancer 3:276–280 11. Chen-Mok M, Bangdiwala SI, Dominik R et al (2003) Termination of a randomized controlled trial of two vasectomy techniques. Control Clin Trials 24:78–84 12. Veronesi U, Maisonneuve P, Costa A et al (1998) Prevention of breast cancer with tamoxifen: preliminary findings from the Italian randomised trial among hysterectomised women. Italian Tamoxifen Prevention Study. Lancet 352:93–97 13. Anon (1994) Clinical advisory: carotid endarterectomy for patients with asymptomatic internal carotid artery stenosis. Stroke 25:2523–2524 14. Anon (1995) Carotid endarterectomy for patients with asymptomatic internal carotid artery stenosis. National Institute of Neurological Disorders and Stroke. J Neurol Sci 129:76–77 15. Buchwald H, Matts JP, Hansen BJ et al (1987) Program on surgical control of the hyperlipidemias (POSCH): recruitment experience. Control Clin Trials 8:94S–104S
7
How to Recruit Patients in Surgical Studies Hutan Ashrafian, Simon Rowland, and Thanos Athanasiou
Contents 7.1
Introduction ............................................................
75
7.2
Planning and Organisation....................................
76
7.3
Timing .....................................................................
76
7.4
Patients’ Point of View ...........................................
76
7.5
The Recruitment Team ..........................................
76
7.6
Recruitment Skills ..................................................
76
7.7
Sources of Recruitment..........................................
77
7.8
Balance Between Inclusion/Exclusion Criteria ...
77
7.9
Prerequisites ...........................................................
78
7.10
Factors to Increase Participation ..........................
78
7.11
Factors to Ensure Continued Participation .........
78
7.12
Patient Subgroups ..................................................
80
7.13
Practicalities............................................................
80
7.14
Conclusion...............................................................
80
References ...........................................................................
80
H. Ashrafian () The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, London W2 1NY, UK e-mail:
[email protected] Abstract The process of recruiting patients into any clinical study is fundamentally critical for the implementation, execution and completion of any project. Within this chapter, some of the salient points involved in patient recruitment will be identified and categorised so as to familiarise the reader with the necessary concepts required to recruit patients for a surgical controlled trial.
7.1 Introduction The process of recruiting patients into any clinical study is fundamentally critical for the implementation, execution and completion of any project [10]. Simply put, if the study does not have the required number of patients to examine, then no adequate conclusion regarding outcomes and results can be attained [1, 4]. The completion of such a critical process for a randomised clinical trial costs more and consumes more time than any other aspect of the project. Furthermore, the actual task of recruitment can potentially be performed by almost any member of the surgical research team, and recruiters do not necessarily need formal medical or surgical qualifications. As a result, there is a vast array of organisational input, which is required in order to successfully enrol patients within a clinical study. This is usually addressed by a specific recruitment sub-committee. Within this chapter, some of the salient points involved in patient recruitment will be identified and categorised so as to familiarise the reader with the necessary concepts required to recruit patients for a surgical controlled trial.
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_7, © Springer-Verlag Berlin Heidelberg 2010
75
76
7.2 Planning and Organisation Many recruitment difficulties occur as a result of insufficient organisation and poor scheduling. Generally, an over-estimation of the recruitment numbers from one or two limited sources (such as relying heavily upon only medical referrals) leads to poor yields [2]. Definitive recruitment strategies need to be implemented, and specific management of recruitment staff and subjects needs to be instigated with database administration and establishment of a monitoring system. Clear levels of organisational leadership need setting up in the guise of a recruitment team, and formal lines of communication should be laid with other groups of the research trial team.
7.3 Timing In many surgical study protocols, a final date of recruitment is set so as to have completed follow-up data by the end of the study. If, therefore, few patients are recruited within this period, subjects enrolled later would have results with an unsatisfactory follow-up period and would be unsuitable for inclusion in data-analysis. This could significantly detract from the required number of subjects in a power study. Furthermore, if too many patients are recruited in a very short time, then there is a risk that any unidentified problems (such as long-term surgical complications or cumulative drug toxicity) may present concurrently in a large number of individuals, overwhelming the trialists and preventing adequate time for remedial action to take place. Too many patients entering at the same time can also lead to a deluge of new data that is difficult to process by a limited number of trialists [6]. Enrolment should, therefore, ideally occur at a constant rate to maintain study power and minimise uneven or excessive research effort during the follow-up period.
H. Ashrafian et al.
procedure to a minimally invasive equivalent (laparoscopic or thoracoscopic), and thus, patients may be more inclined to opt for the less invasive option despite an unknown outcome. This may skew the subject demographic entering a trial, and it is up to the recruitment committee to try and ensure that all patients being approached are given a clear and concise background and explanation for why the trial is important, and why both arms need adequate case loads in order to achieve an interpretable result.
7.5 The Recruitment Team Any healthcare professional who understands the research question and the trial proposed can be included in the recruitment team. This can typically involve senior surgeons, junior doctors, specialist nursing staff, physician assistants, specialist physiotherapists and many more. Some recruiters will be formally employed as part of the research team of the trial, whereas other might be those who have an interest in taking part in the trial, and their input will be totally voluntary. Either way, it is important for these individuals to have a clear understanding of the research aims and protocol, and this would usually occur at research briefings, pre-trial education of staff and continual peri-trial teaching sessions. Recruiters should be picked on grounds of positive attitude, merit, lasting loyalty and commitment to the project. Pressure of work and time commitments can decrease the momentum in any recruitment process, and thus, it is important for recruiters to feel a sense of collective ownership for the project at hand, and regular contact with other recruiters and the research team at set meeting points will reinforce the need for recruitment, and will continue project encouragement.
7.6 Recruitment Skills 7.4 Patients’ Point of View Many surgical projects in a randomised trial involve comparing surgery to a non-interventional treatment, or two procedures, one more invasive than the other. A typical example would include comparing an open
The skills required in the recruitment process are broad, but require recruiters to be good communicators and also totally unbiased in their approach to selecting and enrolling patients to each study. Typically, the best recruiters need to have good interpersonal skills and empathy with their patients. They need to describe to
7
How to Recruit Patients in Surgical Studies
77
each potential trial subject the background of the research, the importance of the research question and the implications for those taking part. As a result, those recruiting need to have a good understanding of the underlying science of the study, and need to have this understanding normally assessed and approved by the trial organizing committee. Importantly, and as with all member of a trial, participants need to demonstrate good team skills, and should also be supplied with the space and facilities with which to recruit and contact patients. Therefore, there is a basic requirement of adequate office space, telephone, electronic mail and internets access.
referred from the trial directly from physicians in the community or other referring units and hospitals. Other sources include those from patient registries either set-up by the clinical trial in question, or that of a similar clinical trial. These registries are large databases containing lists of active clinical trials or of potential participants. Depending on the nature of subjects required, national and even international screening for patients can occur via media (television, radio, internet) and direct mail wherein patients are invited to come and take part in a trial, for example, The Italian Breast Cancer Prevention Trial with Tamoxifen, where television advertisements by the organising trial committee invited patients to take part in their study [11].
7.7 Sources of Recruitment The initial selection of candidates for recruitment can be largely multifaceted, though requires total anonymity and professional conduct at all times. Sources can include patient medical records or direct recruitment of subjects from inpatient wards and outpatient clinics. Furthermore, once a research trial has been established, publicity to local primary care units and regional centres will ensure that some patients will be potentially
7.8 Balance Between Inclusion/ Exclusion Criteria The number of subjects who will finally take part in a clinical trial includes the number of baseline subject candidates who will then be accordingly filtered on the basis of inclusion and exclusion criteria (Fig. 7.1). Thus, typically, many studies initially over-estimate
Asked to take part ( n=1000 )
Unwilling to take part ( n=250 )
Those that fit inclusion criteria ( n=750)
Exclusion criteria ( n=250 )
Total subjects that start study ( n=500 )
Fig. 7.1 Flow chart for a hypothetical recruitment process, demonstrating a decrease in numbers at each stage of selection based on inclusion criteria, exclusion criteria and patient compliance
In-study dropout ( n=100 )
Total number of subjects that complete study ( n=400 )
78
likely numbers of patients in a study, as it is typically over-looked that at each stage of inclusion or exclusion, patients’ number drops. Furthermore, as any study continues, patients may drop out because of the lack of commitment or in-trial exclusion criteria.
7.9 Prerequisites Once a trial is set up and the recruiters selected, there are a number of prerequisites that are required to be adhered to before any study subjects are selected (Fig. 7.2). To begin, there is a compulsory collective responsibility to ensure that the study has been approved by the data monitoring committee and the local and if necessary, national ethics groups. Each recruiter needs to be avidly aware of the nature of the research and the ethical implications of placing subjects within the sphere of the trial. This is important as it ensures that all recruiters know the nature of the trial intrinsically, including all the pros and cons of the subjects taking part. Furthermore, it allows all recruiters to accurately describe the implications of taking part in the research project. This also follows and equips the recruiters to fulfil another important aspect of their project, which is to obtain informed consent from their subjects. Furthermore, it is important that recruiters are in a position to answer, in as many ways possible, the questions that may arise during the recruitment process. As a result, it is necessary that all recruiters are adequately trained and have increasingly undergone a formal training and debrief by the trial organisers. These recruiters need to be familiar with a variety of methods of conveying information about the trial to subjects, which can include one-to-one meeting and the supply of contact details and information sheets for patients.
7.10 Factors to Increase Participation There are a number of contributing factors that can ensure successful and plentiful recruitment of subjects within each trial (Fig. 7.2). Large numbers of recruiters will allow a wide net of recruitment to be achieved through the exertion of numerous man-hours of exertion, and one way in which this can be achieved is by employing a multitude of multi-disciplinary recruiters.
H. Ashrafian et al.
This will not only strengthen the research process, but will also give a wider concept of collective ownership to researchers in general, and will ultimately allow an increased integration of resources and possibly a wider dissemination of results on completion. As with any research project, enthusiastic team members will lead to dynamism in effort and activity that will allow adequate numbers of subjects being approached and investigated for admittance into a trial. If a variety of recruitment modalities is utilised (such as by telephone, email or in person), patient numbers can be increased geometrically as opposed to arithmetically. These methods can be augmented by wider public advertising such as by presentation of the research and recruitment needs in the public media (radio, television etc.), indeed if general awareness of the project is important, not only directed to patients but also directed to those in medical world. Thus, the onus lies with the recruiters and the research team to broadcast and publicise the need for recruitment to converse with medical colleagues at all levels – local, national and international. This can be done through meetings, lectures, conferences and even word of mouth, but will eventually result in the broadest possible referral base for subjects in which to be recruited. Subjects being recruited need to be reassured that they will come to adverse consequence if they do not participate in the study, but should be encouraged to participate on grounds of personal interest and patient bonhomie. Occasionally, a trial offers a totally new treatment in a patient with an otherwise untreatable condition, and the offer of a trial may then be a direct medical benefit to that subject. Furthermore, including an “opt-out” [7] availability in the study will allow patients who drop out of treatment during a trial to still be eligible for follow-up in the trail, so as their participation is not a total loss for the study.
7.11 Factors to Ensure Continued Participation As already alluded above, once trial subjects have been entered into a trial, it is still highly important to ensure that these subjects remain within the trial; otherwise, the entire effort placed into acquiring their data may become unusable or lost. This will result in required extra effort, time and may be the failure of the project,
7
How to Recruit Patients in Surgical Studies
79
Local Ethics Approval Informed consent Data Monitoring committee ( DMC ) Approval Funding
Prerequisites
Legal Indemnity Adequately trained Recruiters Information Sheets for Patients Information Sheets for Medical Professionals Researcher Education
Enthusiastic Researchers
Academic Gain no adverse consequences to non-participation Telephone Multimodal recruitment
E-mail Personal
Multilocation recruitment
Clinic Ward
Large number of recruiters Nurses Doctors Multidiciplinary recruitment
Physiotherapists Physician Assistants
Clinical Trial Recruitment
Other trained professionals
Factors to increase participation and the branches
National
Large number of recruiting centres
International Increased patient Interest for Research Study
Personal Medical Benefit Patient Bonhomy
Advertising
Medical General Public
Lectures Increased Awarness of Trial
Conferences Word of Mouth
Medical Patient
“opt-out” availability (i.e non-responders can be followed up with further communication
Easy to reach Pleasant Staff Trial Location
Professional Looking Site General Care and Cordiality
Factors to ensure continued Participation and the branches
Reimbursement for travel expenses Suitably qualified staff on days of study Good timekeeping on days of study Payment incentives ( Within ethical remit ) Out of hours and daytime contact details in case of participant problems or queries
Fig. 7.2 Summary of the contributing factors in the successful recruitment of patients for a clinical trial
and thus, it is up to the recruiters and the research team as a whole to try and optimise the numbers of patients who complete the project and follow-up (Fig. 7.2). Many of these issues are considered as common sense, though have been recurring sources of trial delay
or failure. Trial sites should be easy to reach and welcoming, with staff cordiality encouraged at all times. Ideally, subjects would be paid for their travel, and adequately trained and informed staff will be on hand at all times so as to discuss any queries that subjects
80
may have. Some sources actually advocate a somewhat controversial idea that trialists and recruiters should get “payment incentives” [3] to ensure the smooth running of a study, but nevertheless however achieved, a standard of commitment and professionalism needs to be attained for subjects to continue returning to participate in a trial. Furthermore, this level of dedication needs translation to a further level, where at least one of the recruitment teams should be available at all times to address out-of hour and holiday-time queries by subjects, which some consider a type of “on-call” recruitment rota so as to provide patients with a complete umbrella of care during their time with the trial research group.
H. Ashrafian et al.
Salient issues include cordiality to all subjects at all times. A welcoming environment needs to be established, and hospitality to all subjects needs to be universal. This includes the reimbursement of travel expenses, the provision of refreshments and reading material if patients are to wait for long periods. An environment of open communication should be encouraged, allowing patients to feel comfortable in expressing any reservations they have to any aspects of the trial. Such feedback is vital and should be used to improve trial hospitality and practice where possible.
7.14 Conclusion 7.12 Patient Subgroups Recruiters should be aware of the variation in recruiting people from different backgrounds, ages, gender, socioeconomic status and disease [5, 8]. For example, studies of patients with HIV/AIDS [9] need to be seen to have been added confidentially to reassure subject in a trial. Some units find it notoriously difficult to recruit from racial minorities and require extensive campaigns to communicate to these populations. Furthermore, a number of studies reveal that elderly populations demonstrate a preference for person-to-person contact as opposed to written contact. As a result, each recruitment process needs to account for the special needs of each patient subgroup, and the methods used to encourage patients to enter these trials need to be tailor made for the subjects in question.
7.13 Practicalities Consideration of the practicalities of trial recruitment is fundamentally important for successful execution. For example, it is essential to recognise and develop aspects that can increase subject comfort and satisfaction. A good rapport with subjects can be very helpful and may increase their willingness to return for follow-up sessions. Furthermore, it may allow for increased openness in the discussion of symptoms that may be vital for the documentation and analysis of study results.
The successful recruitment of subjects within a surgical trial involves adequate planning, effective teamwork and a multi-modal approach in selection. Specifically trained recruiters should be equipped with a broad knowledge-base and skills with which to effectively inform and select patients. Recruitment of subjects can come from a large source of patient groups within a variety of healthcare systems, but should be targeted for the specific research area of the trial. The role of recruitment is a fundamental necessity to the ultimate completion of a trial, and if successful, can greatly strengthen the quality of the data eventually collected.
References 1. Allen PA, Waters WE (1982) Development of an ethical committee and its effect on research design. Lancet 1: 1233– 1236 2. Bell-Syer SE, Moffett JA (2000) Recruiting patients to randomized trials in primary care: principles and case study. Fam Pract 17:187–191 3. Bryant J, Powell J (2005) Payment to healthcare professionals for patient recruitment to trials: a systematic review. BMJ 331:1377–1378 4. Charlson ME, Horwitz RI (1984) Applying results of randomised trials to clinical practice: impact of losses before randomisation. Br Med J (Clin Res Ed) 289:1281–1284 5. Clark MA, Neighbors CJ, Wasserman MR et al (2007) Strategies and cost of recruitment of middle-aged and older unmarried women in a cancer screening study. Cancer Epidemiol Biomarkers Prev 16:2605–2614 6. Johnson L, Ellis P, Bliss JM (2005) Fast recruiting clinical trials–a utopian dream or logistical nightmare? Br J Cancer 92:1679–1683
7
How to Recruit Patients in Surgical Studies
7. Junghans C, Feder G, Hemingway H et al (2005) Recruiting patients to medical research: double blind randomised trial of “opt-in” versus “opt-out” strategies. BMJ 331:940 8. Keyzer JF, Melnikow J, Kuppermann M et al (2005) Recruitment strategies for minority participation: challenges and cost lessons from the POWER interview. Ethn Dis 15:395–406 9. King WD, Defreitas D, Smith K et al (2007) Attitudes and perceptions of AIDS clinical trials group site coordinators
81 on HIV clinical trial recruitment and retention: a descriptive study. AIDS Patient Care STDS 21:551–563 10. Mapstone J, Elbourne D, Roberts I (2007) Strategies to improve recruitment to research studies. Cochrane Database Syst Rev (2):MR000013 11. Veronesi U, Maisonneuve P, Costa A et al (1998) Prevention of breast cancer with tamoxifen: preliminary findings from the Italian randomised trial among hysterectomised women. Italian Tamoxifen Prevention Study. Lancet 352:93–97
8
Diagnostic Tests and Diagnostic Accuracy in Surgery Catherine M. Jones, Lord Ara Darzi, and Thanos Athanasiou
Abbreviations
Contents Abbreviations .....................................................................
83
8.1
Introduction ............................................................
84
8.2
What Is a Diagnostic Test? ....................................
84
8.2.1 Criteria for a Useful Diagnostic Test........................ 8.2.2 Choosing Diagnostic Endpoints ............................... 8.2.3 Diagnostic Test Data ................................................
84 84 85
8.3
Quality Analysis of Diagnostic Studies.................
85
8.3.1 Reporting in Diagnostic Tests .................................. 8.3.2 Sources of Bias in Diagnostic Studies .....................
86 86
8.4
Estimates of Diagnostic Accuracy.........................
88
8.4.1 8.4.2 8.4.3 8.4.4 8.4.5 8.4.6
True Disease States .................................................. Sensitivity, Specificity and Predictive Values .......... Likelihood Ratios ..................................................... Diagnostic Odds Ratio ............................................. Confidence Intervals and Measures of Variance ...... Concluding Remarks About Estimates of Diagnostic Accuracy ............................................
88 89 90 91 91
Receiver Operating Characteristic Analysis........
91
8.5.1 The ROC Curve ........................................................ 8.5.2 Area Under the ROC Curve .....................................
92 92
8.6
Combining Studies: Diagnostic Meta-Analysis ...
93
8.6.1 Goals and Guidelines ............................................... 8.6.2 Heterogeneity Assessment ....................................... 8.6.3 Diagnostic Meta-Analysis Techniques .....................
93 93 94
8.7
Conclusions .............................................................
97
References ...........................................................................
97
List of Useful Websites.......................................................
98
8.5
91
C. M. Jones () The Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, Queen Elizabeth the Queen Mother (QEQM) Building, Imperial College Healthcare NHS Trust, St Mary’s Hospital Campus, Praed Street, London W2 1NY, UK e-mail:
[email protected] AUC DOR FN FP FPR HSROC
Area under the curve Diagnostic odds ratio False negative False positive False positive rate Hierarchical summary receiver operating characteristic LR Likelihood ratio NPV Negative predictive value PPV Positive predictive value QUADAS Quality assessment of diagnostic accuracy studies RDOR Relative diagnostic odds ratio ROC Receiver operating characteristic SROC Summary receiver operating characteristic STARD Standards for reporting of diagnostic accuracy TN True negative TP True positive TPR True positive rate
Abstract This chapter outlines the principles of designing, performing and interpreting high quality studies of diagnostic test accuracy. The basic concepts of diagnostic test accuracy, including sensitivity, specificity, diagnostic odds ratio (DOR), likelihood ratios (LRs) and predictive values, are explained. Sources of bias in diagnostic test accuracy studies are explained with practical examples. The graphical ways to represent the accuracy of a diagnostic test are demonstrated. Issues to consider when interpreting the quality of a published study of test accuracy are outlined using
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_8, © Springer-Verlag Berlin Heidelberg 2010
83
84
guidelines advocated in the medical literature. Finally, the principles of diagnostic meta-analysis, including bivariate and hierarchical summary receiver operating characteristic methods, are explained.
8.1 Introduction Accurate diagnosis is the first step towards effective treatment of a disease. In surgical practice, prudent use of appropriate tests to reach accurate, fast diagnosis can be especially important in maintaining a high quality of patient care. From a general perspective, any investigation that provides information to direct management is a form of diagnostic test. Screening programs and prognostic indicators can be thought of as special cases of diagnostic tests. The need for diagnostic test guidelines is increasing across medicine, as the amount of medical literature expands yearly. The cost of diagnostic tests contributes significantly to healthcare costs, and must be a practical consideration when choosing a test. The cost of a test includes not only the immediate expense of the test materials and staff time, but also the subsequent costs of further tests performed on the basis of equivocal or inaccurate results. The decision to make large-scale screening and other diagnostic test policy must take into account the costs, but the clinical benefit to the target patient population will be necessarily paramount. Summary measures of diagnostic test performance are often quoted in the literature, and an understanding of these is important for clinicians. This chapter explains the different methods for quantifying the accuracy of a diagnostic test, and aids in the understanding of the principles of diagnostic research.
8.2 What Is a Diagnostic Test? In its broadest context, a medical diagnostic test is any discriminating question that, once answered, provides information about the status of the patient. While this classically means diagnosis of a medical condition, any outcome of interest is potentially “diagnosable”, including future clinical events. This chapter deals with the principles of medical diagnosis, and although prognostic analysis uses statistical methodology similar to
C. M. Jones et al.
that discussed in this chapter, it will be discussed elsewhere in this book. Diagnostic tests in surgical practice include history and examination findings, laboratory investigations, radiological imaging, clinical scores derived from questionnaires and operative findings. The underlying principles for use and applicability apply equally.
8.2.1 Criteria for a Useful Diagnostic Test A diagnostic test should only be performed if there is expectation of benefit for the patient. The individual situation of each patient should be considered before choosing a diagnostic pathway which maximises benefit and minimises risk. Pepe [1] summarises the criteria for a useful diagnostic test. The disease should be serious enough to warrant investigation, have reasonable prevalence in the target population and change the treatment plan if present. The test should also accurately classify diseased from non-diseased patients. The risks of performing the test should not outweigh the diagnostic benefits. In practice, the benefit-to-risk assessment is based on patient factors, test availability, test performance in the medical literature and the experience and personal preference of the managing clinician. Choosing a diagnostic test requires knowledge of its performance across different patient populations. There is no benefit in applying a test to a population in whom there is no evidence of diagnostic accuracy. Clinical experience and medical literature provide information on reliable interpretation of results. In summary, a diagnostic test is indicated if the suspected disease warrants treatment or monitoring, or significantly changes the prognosis. The test result should change patient care, according to the patient and clinician preferences. Finally, the accuracy of the test for the desired diagnosis, in that clinical setting, should be acceptably high to warrant the performance of the test.
8.2.2 Choosing Diagnostic Endpoints In some circumstances, a diagnostic test will be performed solely to exclude (or confirm) a medical condition; in these cases, the most appropriate test is one that reliably excludes (or confirms) its presence. In other
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
Table 8.1 Examples of targeted diagnostic tests Test Targeted outcome Blood product screening
Exclusion of transmittable disease
Operative cholangiogram
Exclusion of common duct stone
Carotid angiogram
Confirm artery stenosis seen on ultrasound, prior to surgery
Preop chest X ray
Exclude major pulmonary disease
Cervical smear testing
Exclude cervical cell atypia
circumstances, confirmation and exclusion of disease are equally important. The decision to perform a test will, therefore, depend not only on its performance in the target population, but also on the endpoint(s) of interest. Several examples of targeted tests are given in the Table 8.1. In particular, screening tests are targeted towards exclusion of disease, as they are applied on a grand scale to detect occult disease. If positive, the screening test is often followed by other investigations with higher overall accuracy. The requirements of the screening test to be safe, inexpensive and convenient mean that the accepted levels of diagnostic accuracy can be less than ideal.
8.2.3 Diagnostic Test Data Test results can be expressed in a variety of ways, depending on the method of testing and clinical question. Table 8.2 provides a summary of the types of data generated by diagnostic tests and common examples. A dichotomous result is one in which there is “yes/ no” answer. The desired outcome is either present, or it is not. An example of this is microbiological testing for a virus; the patient is either positive or negative for the virus. This is the simplest result to interpret. An ordinal result is one in which there are a number of possible outcomes, in an ordered sequence of probability Table 8.2 Types and examples of result variables Dichotomous
Viral testing, bacterial cultures
Ordinal
V/Q scanning, Gleason score, Glasgow Coma Scale
Categorical
Genotype testing, personality disorders
Continuous
Biochemical tests, height, weight
85
or severity, but with no quantitative relationship between the outcomes. An example of this is lung ventilation/ perfusion scanning for pulmonary embolus. The result may be low, intermediate or high probability of pulmonary embolus. Another example of an ordinal test result is the Gleason histopathology score for prostate cancer, which influences prognosis and treatment. The result of an ordinal test should be interpreted in accordance with the available literature, before subsequent decisions are made about the need for further investigative tests or treatments. A categorical variable has a number of possible outcomes that are not ordered in terms of severity or probability. An example of this is genetic testing, which may have two or more possible results for a gene code in an individual. There is no implicit order in the outcomes per se. Interpretation of this type of test will require knowledge of the implications of each possible outcome. Many investigations yield a numerical result, which lies on a continuous spectrum. Examples of this include body weight, serum creatinine, body mass index and blood serum haemoglobin testing. A test that gives a continuous variable requires a pre-determined threshold for disease in order for the result to be meaningful. For example, a pre-determined value may be the threshold for anaemia requiring blood transfusion. A different value may be the threshold for iron supplementation. Another example is prostatic surface antigen in screening for prostatic cancer. In all diagnostic tests, a threshold for changing the patient’s management is required. For a dichotomous result, this threshold is incorporated within the test to produce a positive or negative result. Ordinal, categorical and continuous tests require a suitable threshold to be chosen, which will depend on the patient, clinician and local guidelines. The diagnostic accuracy of the test may depend on the chosen threshold, which will be explained later in the chapter.
8.3 Quality Analysis of Diagnostic Studies Studies of diagnostic test accuracy compare one or more index tests against a “gold standard” reference test in the same population. These studies provide guidance for clinical practice and impetus for the development of
86
new diagnostic technologies and applications. However, the publication of exaggerated results from poorly designed, implemented or reported studies may cause suboptimal application and interpretation of the diagnostic test. The importance of accurate and thorough reporting is now widely accepted.
8.3.1 Reporting in Diagnostic Tests Published guidelines are designed to promote diagnostic test study quality, focussing on accurate and thorough reporting of study design, patient population, inclusion and exclusion criteria and follow up [2]. The STARD initiative (Table 8.3) was published in several leading journals in 2003 and resulted from collaboration between journal editors and epidemiologists. It presents a checklist of 25 items for authors and reviewers to promote acceptable standards in study design, conduct and analysis. STARD is now an accepted guideline for authors and reviewers of single articles of diagnostic accuracy. In 2003, the QUADAS tool for quality assessment of studies included in diagnostic test systematic reviews was published (Table 8.4). This tool consists of fourteen items, each of which should be used in diagnostic meta-analyses to evaluate the impact of study reporting on the overall accuracy results. The role of QUADAS in meta-analysis is discussed further in later section of this chapter.
8.3.2 Sources of Bias in Diagnostic Studies As with other types of studies, diagnostic research is vulnerable to bias in study design, implementation, reporting and analysis. Bias refers to differences between the observed and true (unknown) values of the study endpoints. Generally, bias cannot be measured directly, but robust methodology and endeavours to minimise known sources of bias increase the credibility of observed results. Biased results may occur in studies where there are flaws in study design. Begg and McNeil [3] provide a thorough overview of the sources of bias in diagnostic tests. A summary of the more common and important types of bias is provided here.
C. M. Jones et al.
8.3.2.1 Reference Standard Bias The choice of reference test is central to reaching robust conclusions about the usefulness of a diagnostic test. The reference test should correctly classify the target condition, and is ideally the definitive investigation. In surgical practice, this may represent surgical intervention to achieve histology or visual inspection of a tumour. Alternatively, the reference test may incorporate imaging, biochemistry or other testing. An inaccurate reference test leads to misclassification of true disease status and makes assessment of the index test inaccurate (reference standard error bias). The entire study population, or a randomly selected subgroup, should undergo the same index and reference tests. Partial verification bias, also known as workup bias, occurs when a non-random selection of the study population undergoes both the index and reference tests. If the results of the index test influence the decision to perform the reference standard, bias arises. It occurs mostly in cohort studies where the index test is performed before the reference standard. Differential verification bias occurs when the reference standard changes with the index test result. This is particularly common in surgical practice wherein the result of a laboratory test may determine whether the patient undergoes surgery or is treated conservatively. Finally, incorporation bias occurs when the index test result contributes to the final reference test result, making the concordance rate between the two tests artificially high. The diagnostic and reference tests should be performed within a sufficiently short time period to prevent significant disease progression (disease progression bias). Within a diagnostic study, the decision to perform both tests should be made prior to either test being performed. The method used to perform the index and reference tests should be clearly provided for reproducibility elsewhere.
8.3.2.2 Investigator-Related Factors in Bias In studies where one test is interpreted with awareness of the other test result (review bias), there is a tendency to over-estimate the accuracy. For example, equivocal liver lesions on ultrasound may be diagnosed as metastatic lesions if a recent CT of the abdomen mentions liver metastases. This may be acceptable in clinical practice, but is undesirable in impartial assessment of
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
87
Table 8.3 The STARD checklist (reproduced) Section and topic Item Criterion Title/abstract/keywords
1
Identify the article as a study of diagnostic accuracy (recommend MeSH heading of “sensitivity and specificity”
Introduction
2
State the research question or study aims, such as estimating diagnostic accuracy or comparing accuracy between tests or across participant groups
Methods Participants
Test methods
Statistical methods
Describe 3
Study population: the inclusion and exclusion criteria, setting and locations where the data were collected
4
Participant recruitment: was recruitment based on presenting symptoms, results from previous tests, or the fact that the participants had received the index tests or the reference standard?
5
Participant sampling: Was the study population a consecutive series of participants defined by the selection criteria in items 3 and 4? If not, specify how participants were further selected
6
Data collection: Was data collection planned before the index test and reference standard were performed (prospective study) or after (retrospective study)?
7
The reference standard and its rationale
8
Technical specifications of material and methods involved including how and when measurements were taken, and/or cite references for index tests and reference standard
9
Definition and rationale for the units, cut-offs and/or categories of the results of the index tests and the reference standard
10
The number, training and the expertise of the people executing and reading the index tests and the reference standard
11
Whether or not the readers of the index tests and reference standard were blind (masked) to the results of the other test and describe any other clinical information available to the readers
12
Methods for calculating or comparing measures of diagnostic accuracy, and the statistical measures used to quantify uncertainty (e.g. 95% confidence intervals)
13
Methods for calculating test reproducibility, if done
Results Participants
Test results
Estimates
Report 14
When study was done, including beginning and end dates of recruitment
15
Clinical and demographic characteristics of the study population (e.g. age, sex, spectrum of presenting symptoms, co-morbidity, current treatments, recruitment centres)
16
The number of participants satisfying the criteria for inclusion that did or did not undergo the index tests and/or reference standard; describe why participants failed to receive either test (flow chart recommended)
17
Time interval from the index tests to the reference standard, and any treatment administered in between
18
Distribution of severity of disease (define criteria) in those with the target condition; other diagnoses in those without the target condition
19
A cross tabulation of the results of the index test (including indeterminate and missing results) by the results of the reference standard; for continuous results, the distribution of the test results by the results of the reference standard
20
Any adverse events from performing the index tests or the reference standard
21
Estimates of diagnostic accuracy and measures of statistical uncertainty (e.g. 95% confidence intervals)
22 test. How indeterminate results, responsesreview and outliers of athesimilar index tests were handled the accuracy of a diagnostic Many diagnostic testsmissingClinical bias is concept. When there require no subjective assessment and give is a subjective or interpretive component to thereaders test result, 23 Estimates of dichotomous variability of diagnostic accuracy between sub-groups of participants, or centres, if done results, and in such cases, the people performing the clinical information may contribute to the final diagnosis. tests need to only record or Ififthis occurs, it should be made clear in the methodology 24the outcome. Estimates Biochemical of test reproducibility, done bacterial culture testing25are examples of clinical such tests. Discussion Discuss the applicabilitysoofthat the subsequent study findingsadoption of the technique elsewhere is
88
C. M. Jones et al.
Table 8.4 QUADAS tool for diagnostic meta-analysis quality assessment (reproduced) Item 1
Was the spectrum of patients representative of the patients who will receive the test in practice?
2
Were selection criteria clearly described?
3
Is the reference standard likely to correctly classify the target condition?
4
Is the time period between index test and reference standard short enough to be reasonably sure that the target condition did not change between the two tests?
5
Did the whole sample, or a random selection of the sample, receive verification using a reference standard of diagnosis?
6
Did patients receive the same reference standard regardless of the index test result?
7
Was the reference standard independent of the index test (i.e. the index test did not form part of the reference standard)?
8
Was the execution of the index test described in sufficient detail to permit replication of the test?
9
Was the execution of the reference standard described in sufficient detail to permit its replication?
10
Were the index test results interpreted without knowledge of the results of the reference standard?
11
Were the reference standard results interpreted without knowledge of the results of the index test?
12
Were the same clinical data available when test results were interpreted as would be available when the test is used in practice?
13
Were uninterpretable/intermediate test results reported?
14
Were withdrawals from the study explained?
accuracy. An example of this is knowledge of the patient’s age when assessing bony lesions on plain radiographs, as age is highly discriminating. Equivocal or unavailable results from either test pose a dilemma for study coordinators. If the unavailable results are excluded from analysis, biased results can occur if the remaining study population differs significantly from the initial group. Whether bias arises depends on the reasons for non-interpretable results.
8.3.2.3 Population Factors in Bias Spectrum bias refers to the lack of generalisability of results when differences in patient demographics or clinical features lead to differences in diagnostic test accuracy. Reported test accuracy may not be applicable to patient populations differing to the original study population in severity of disease, co-morbidities or specific demographics. Relevant patient characteristics should, therefore, be clearly reported. Control groups of “normal” patients tend to be healthier than the average non-diseased subject, and known cases to have severe disease, overestimating test accuracy as extremes
Yes
No
Unclear
of health are easier to detect. Patients should, therefore, be selected to represent the target population, which should be clearly described. Patients of special interest can be chosen to demonstrate the test accuracy in a particular subgroup, but this should be clearly explained in the paper to prevent erroneous application to other populations. The selection criteria for inclusion and exclusion from the study should be clearly stated. Similarly, excluding patients with diseases known to adversely affect the diagnostic performance of the test leads to limited challenge bias, as in the example of patients with chronic obstructive pulmonary disease excluded from a study of ventilation/perfusion scintillation scanning.
8.4 Estimates of Diagnostic Accuracy 8.4.1 True Disease States For each patient, the reference test is assumed to indicate true disease status. The index test result is classified as true positive (TP), true negative (TN), false
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
Fig. 8.1 Graphical representation of correlation between index and reference test results
89
Reference Test +
-
+
TP
FP
TP+FP
-
FN
TN
FN+TN
TP+FN
FP+TN
TP+FN+FP+TN
Index test
Reference Test +
-
+
TP
FP
TP+FP
-
FN
TN
FN+TN
TP+FN
TN+FP
TP+FN+FP+TN
Index test
Fig. 8.2 Sensitivity and specificity calculations. Sensitivity = TP/(TP + FN), specificity = TN/(TN + FP)
positive (FP) or false negative (FN), depending on the correlation with the reference standard. True disease status is a binary outcome. The patient either has the disease or does not. The disease prevalence within the population will also impact upon the overall test accuracy. For example, if the prevalence of the disease is very high, a test which always gives a positive result will be correct in the majority of cases, but is clearly useless as a diagnostic test. If a spectrum of disease is being evaluated, a threshold for positivity is selected. For example, carotid artery stenosis may be considered significant at 70% luminal narrowing, with milder degrees of narrowing classified as a negative disease status. TP, TN, FP and FN outcomes are classified according to Fig. 8.1. “True” outcomes on the index test are those which agree with the reference result for that subject. “False” outcomes disagree with the reference result and are considered inaccurate. An ideal diagnostic test produces no false outcomes.
8.4.2 Sensitivity, Specificity and Predictive Values Sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV) are commonly encountered in clinical practice. Despite their frequent
usage, the terms are often difficult to conceptualise and apply. This section aims to give an understanding and statistical basis for these terms.
8.4.2.1 Sensitivity and Specificity Sensitivity measures the ability of a test to identify diseased patients. A test with 75% sensitivity identifies 75 out of every 100 patients with a positive result on the reference test. Sensitivity gives no indication about the test performance in healthy patients. Statistically, sensitivity is the ratio of TPs to all positive results on the reference test (TP + FN). The numbers in the left column of Fig. 8.2 are used in calculation of sensitivity. Specificity measures how well a test identifies healthy patients. It is the proportion of healthy people who have a negative test result. Like sensitivity, the “true” status is determined by the reference standard. Specificity is complementary to sensitivity, as it gives information only about healthy subjects. Statistically, specificity is the ratio of TNs to all negative results on the reference test (FP + TN) (Fig. 8.2). Sensitivity is also known as true positive rate (TPR), whilst (1-Specificity) is known as false positive rate (FPR). These terms are used in receiver operating characteristic (ROC) analysis, as discussed later in the
90
C. M. Jones et al.
Fig. 8.3 Predictive value calculations. PPV = TP/(TP + FP), NPV = TN/(TN + FN)
Reference Test +
-
+
TP
FP
TP+FP
-
FN
TN
FN+TN
TP+FN
TN+FP
TP+FN+FP+TN
Index test
chapter. Sensitivity and specificity are useful when deciding whether to perform the test. Depending on the clinical scenario, high sensitivity or specificity may be more important and the best test can be chosen for the clinical situation.
be calculated using binomial methods, although if sensitivity and specificity are dependent on additional test characteristics, other methods are preferred (described later in the chapter).
8.4.2.2 Positive and Negative Predictive Values
8.4.3 Likelihood Ratios
PPV and NPV are also used to measure the usefulness of a diagnostic test. PPV and NPV assess the probability of true disease status once the test result is known. For example, a PPV of 80% indicates that 80% patients with a positive test result actually have the disease. A NPV of 40% indicates that only 40% of patients testing negative are truly healthy. PPV and NPV are particularly useful in interpreting a known test result. Statistically, PPV is the ratio of TPs to all the positive test results (TP + FP). NPV is the ratio of TNs to all negative test results (TN + FN) (Fig. 8.3). The ratios are calculated horizontally across the 2 × 2 table. 8.4.2.3 Comparison of Terms Sensitivity and specificity measure the inherent accuracy of the index test compared to the reference test. They are used to compare different tests for the same disease and population, and help clinicians choose the most appropriate test. On the other hand, PPV and NPV are measures of clinical accuracy and provide the probability of a given result being correct. In practice, both types of summary measures are reported in the literature, and the difference between them should be clearly understood. There is often a need for a high sensitivity or specificity, depending on the consequences of the two types of error (FN or FP). The accuracy of a test is, therefore, usually given as a pair of (sensitivity, specificity) values. Confidence intervals for sensitivity and specificity can
LRs quantify how much a test result will change the odds of having a disease. Many clinicians are more comfortable with probabilities than odds, but when used appropriately, they are useful clinical tools. The conversion of a probability into odds and back again is simple, according to the formulas: Odds = Probability/(1 − Probability). Probability = Odds/(1 + Odds). For example, a pre-test probability of 75% can be converted to odds of 0.75/0.25 = 3. The odds are 3–1. The odds of having the disease after the test result is known (post-test odds) will depend on both the pre-test odds and the LRs. The positive likelihood ratio (LR+) indicates how the pre-test odds of the disease change when the test result is positive. Similarly, the negative likelihood ratio (LR−) indicates how the odds change with a negative test result. Pre-test odds usually depend on the prevalence of the disease in that population and individual patient characteristics. The pre-test odds must be estimated by the clinician before the post-test odds can be calculated. The LRs are calculated by: Positive likelihood ratio (LR+) = sensitivity/(1−specificity) Negative likelihood ratio (LR−) = (1− sensitivity) / specificity Post-test odds = Pre-test odds × Likelihood Ratio.
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
LRs per se do not vary with disease prevalence, although they are vulnerable to the same factors that affect sensitivity and specificity. The LRs may, therefore, vary between populations. The post-test odds are calculated as pre-test odds multiplied by the relevant LR.
8.4.4 Diagnostic Odds Ratio The DOR is another summary measure of diagnostic accuracy. It is defined as Diagnostic odds ratio: DOR = LR + /LR − sensitivity/(1 − specificity) = . (1 − sensitivity)/specificity
If the 2 × 2 table is used to gather data (as in Fig. 8.1), the formula for DOR simplifies to DOR = (TP × TN)/(FP × FN). DOR is a summary measure of the odds of positive test results in patients with disease, compared with the odds of positive test results in those without disease. It is a measure of test discrimination, and variation of threshold makes this discriminating power favour either sensitivity or specificity. In clinical practice, DOR is not commonly used for deciding on the best test on an individual basis; however, it is a useful measure for comparing different tests, and is often estimated in diagnostic test studies and meta-analyses.
8.4.5 Confidence Intervals and Measures of Variance Estimating confidence intervals for ratios is more complex than for proportions. Natural logarithmic transformation of the ratio is performed, and variance equations are used to calculate the confidence intervals. Once the limits for the log transformation are known, the antilogarithmic transformation is performed to get the confidence interval limits for the original ratio variable. The equations for log (LR) variances are given in detail in Pepe [1] and other texts. The variance of log DOR, which is useful in calculating confidence intervals for overall accuracy, is simplified down to:
91
var{log (DOR)} = 1 / TP + 1 / FP + 1 / FN + 1 / TN, where TP, FP, FN and TN are the entries in the 2 × 2 table (Fig. 8.1).
8.4.6 Concluding Remarks About Estimates of Diagnostic Accuracy There are multiple ways to quantify the performance of a diagnostic test. Sensitivity and specificity are used to measure the inherent accuracy of the test, without consideration of clinical needs. Predictive values are used to address the clinical question of how to interpret a specific test finding. LRs relate to inherent accuracy and can be used to estimate post-test probabilities of true disease status. Two-by-two tables are simple, straightforward tools for the calculation of the estimates as well as their confidence intervals. Once the calculations have been performed several times, their calculation becomes simple and understanding their meaning straightforward.
8.5 Receiver Operating Characteristic Analysis Some diagnostic tests do not produce a dichotomous answer (yes/no), but measure a continuous or ordinal scale. Continuous variables are commonly encountered in surgical practice. For example, serum haemoglobin, body temperature, prostatic specific antigen and creatinine are all continuous variables. The interpretation of these is usually done, either formally or informally, in consultation with a preconceived threshold for normality or a threshold to change management. Ordinal variables are less common, but include histopathological determination of the grade of tumour differentiation. Ultimately, the result of a diagnostic test is transformed into a dichotomous result – the probability of disease has reached a certain threshold or it has not. The selection of the appropriate threshold is usually based on guidelines derived from local guidelines or the medical literature. For each test, the choice of threshold will depend on the population, disease, available resources for follow up and consequences of inaccurate diagnosis.
92
C. M. Jones et al.
Fig. 8.4 Schematic ROC curve showing sensitivity vs. (1-specificity) plotted over the unit square. The diagonal represents the performance of the uninformative test
Schematic ROC
1
Sensitivity
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
1-Specificity
8.5.1 The ROC Curve The most commonly used measure to quantify the performance of such tests is the ROC curve, which measures test accuracy over the spectrum of possible results. ROC curve analysis is based on the selection of a threshold for positive results. By convention, we will assume that a larger result indicates the presence of disease, but in practice, a threshold may represent the lower limit of normality. For each threshold, a test has a given sensitivity and specificity. An ROC curve maps the possible pairs of (sensitivity, 1-specificity) that are produced when the threshold is shifted throughout its spectrum. In ROC analysis, sensitivity is also known as TPR, and (1-specificity) is known as FPR. Once the TPR and FPR are calculated for a given threshold, another threshold is chosen, and calculations performed again. Once all the (TPR, FPR) pairs for the test are calculated, TPR is plotted on the vertical axis, and FPR is plotted on the horizontal (Fig. 8.4). The range for TPR and FPR is zero to one, mapping the ROC curve over the unit square, (0,0) to (1,1). Less strict thresholds produce more positives, both true and false, leading to a high TPR and FPR. Stricter criteria for a positive result produce fewer positives, lowering TPR and FPR. The choice of ideal sensitivity and specificity will depend on the clinical situation. The ideal test threshold will discriminate diseased from nondiseased subjects all the time, and will have a TPR of 1
and an FPR of zero (corresponding to the top left corner of the ROC graph). Intuitively, a test that gives the incorrect diagnosis every time is simply the reverse of the ideal test, and while it is not clinically useful, it is actually a highly diagnostic test (once its flaws are known). The uninformative test makes a positive result equally likely for all subjects, irrespective of true disease status. In this case, the TPR equals the FPR, and the resulting ROC curve is a straight diagonal line from (0,0) to (1,1) (Fig. 8.4). As the accuracy of the test improves, the curve moves closer to the top left hand corner of the graph. In practice, the sensitivity and specificity for a test are often reported for a given diagnostic threshold. By demonstrating the accuracy over a spectrum of thresholds, the overall accuracy of the test can be summarised. This allows different tests to be compared through summary measures of the ROC curves.
8.5.2 Area Under the ROC Curve The area under the ROC curve (AUC) is a summary measure of test performance which allows comparison between different tests. TPR and FPR values lie between zero and one inclusive, making the AUC for a perfect test equal to one. The test which produces a false result every time (and is a perfect test to identify the false disease state) is a flat line along the horizontal axis, with an AUC of zero. The random test, allocating positive results half the time, has AUC of 0.5.
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
AUC can be calculated for different diagnostic tests, and then compared. An AUC closer to one indicates a better test.
8.6 Combining Studies: Diagnostic Meta-Analysis This chapter has so far discussed diagnostic accuracy of a single test in a single population. Whilst this can be performed multiple times using different tests, populations or gold standards within the same published article, the process of combining these results to form robust statistical conclusions is most commonly seen in meta-analytical papers. For many diagnostic tests, there will be multiple individual studies available in the literature, with varying results. This section explains the principles of robust diagnostic meta-analytical methodology.
8.6.1 Goals and Guidelines Meta-analysis aims to identify a clinical question and use the available data to produce a comprehensive answer. The collective data provide more reliable and credible conclusions than individual studies, provided that the methodology of the primary studies and metaanalysis is sound. Diagnostic meta-analysis focuses on the performance of a test across the range of studies in the literature. The first steps are to identify an index test, appropriate reference standard and the diagnostic outcome(s) to be measured. For example, it is important to choose an appropriate comparison in carotid artery stenosis measurement – the reference standard and threshold for significant stenosis may vary between studies. It is also vital for the meta-analysts to determine the clinical settings of the test, including test characteristics, clinic setting, target population and threshold for positivity. As far as is practical, it is best to include studies which are similar in design, population, test performance and threshold. As this is not feasible in many meta-analytical projects, there are guidelines available to help reduce the impact of heterogeneity arising from different studies. One method of dealing with differences between studies is to isolate a subset of papers and perform subgroup analysis for more targeted
93
conclusions. A more robust method is to analyse the effect of certain characteristics on the outcomes of the test and assess the effect of study quality on the final results (see Sect. 8.6.2). For example, the accuracy of cardiac stress testing will vary across studies with different target populations or reference standards. Studies comparing thallium studies to angiography are not directly comparable to those comparing thallium studies with coronary CT. More complex statistical models are capable of directly comparing multiple tests within the same analysis, but this is discussed later in the chapter.
8.6.2 Heterogeneity Assessment The variation in results across the studies included in a meta-analysis is termed heterogeneity. The reasons for different results are numerous – chance, errors in analytical methodology, differences in study design, protocol, inclusion and exclusion criteria, threshold for calling a result positive and many others. Compared to studies of therapeutic strategies, the insistence on high quality reporting of study protocols for diagnostic strategies has been lax, leading to sometimes marked differences in published results for the same technique. The earlier discussion on sources of bias within diagnostic testing is useful in this context to explain sources of heterogeneity in diagnostic meta-analysis. In addition, if there are differences in the spectrum of disease found in the studies in the meta-analysis, there may be heterogeneity due to the effect of disease severity on the performance of the test. A test may well be more sensitive to a severe case of a particular disease than a subtle case. The points listed in the STARD and QUADAS tools are particularly important sources of heterogeneity to consider.
8.6.2.1 Quality Assessment (Univariate Analysis) The tools shown in Sect. 8.3 (STARD and QUADAS) are designed to help identify aspects of primary study design that may affect the accuracy results. Whiting et al. [4] demonstrate that summary quality scores like those generated by the STARD tool are not appropriate to assess the effect of study quality on meta-analysis
94
C. M. Jones et al.
results, as a summary score may mask an important source of bias. Analysing each item in the QUADAS tool separately for effect on overall accuracy is a more robust approach. Westwood et al. [5] suggest separate univariate regression analysis of each item, with subsequent multivariate modelling of influential items (with P < 0.10 in the univariate model) to assess their combined influence on the results. The influence of the allocated variable value (or covariate) is expressed as the relative diagnostic odds ratio (RDOR). RDOR is calculated as the DOR when the covariate is positive, and as a ratio to the DOR when the covariate is negative. Westwood et al. [5] provide a thorough discussion and worked examples of univariate and selective multivariate analysis of QUADAS tool items, as well as carefully chosen analysis-specific covariates. The identification of design flaws, population characteristics, test characteristics and other case-specific parameters which affect the reported accuracy can explain the heterogeneity of results and sharpen the conclusions of the meta-analysis.
8.6.2.2 Random and Fixed Effects The variation between results across the group of studies included in the meta-analysis can be approached in two different ways. The variation in results is assumed to be partly due to the studies themselves – patient, protocol or test factors, for example. In fixed effects models, the influence of a given study characteristic on the heterogeneity of the results is assumed to be constant, or fixed, across the studies. However, the distribution of the variable across the studies is not usually known to be constant, and more conservative analysis methods use random effects models. Random effects consider the available studies as a sample of the “population” of all possible studies, and the mean result from the
sample is used to estimate the overall performance of the diagnostic test. The study characteristic proposed to contribute to heterogeneity is, therefore, not assumed to be fixed across the studies. In this way, the level of certainty is less, and therefore, the confidence intervals tend to be wider than fixed effects methods. The variation between studies can be expressed in a variety of ways. A forest plot (Fig. 8.5) is often used to graphically demonstrate the amount of overlap in sensitivity, specificity DOR or LRs between studies. A summary ROC curve can also be used to graphically demonstrate the variety of sensitivity and specificity pairs, as described below. Cochran’s Q statistic is a chi-squared test which is used to estimate the heterogeneity of a summary estimate like DOR or LR. The smaller the value of Q, the lesser the variation across the studies. Analysis of subgroups of the studies with similar characteristics allows heterogeneity to be calculated for factors which do vary across the entire group of studies. For example, if studies including adults only are analysed separately, the heterogeneity that arises cannot be explained by adult-child differences in test accuracy. A weighting can be applied to each study’s result, depending on the choice of weighting felt to be most appropriate. Inverse variance and study size are commonly used weights, where large studies and studies with little variation in results are given greater weighting in the heterogeneity calculations.
8.6.3 Diagnostic Meta-Analysis Techniques Whether meta-analysis of pooled data can be conducted depends both on the number and methodological quality of the primary studies. The simplest method for analysing pooled data from multiple studies is averaging the Sensitivity (95% CI)
Fig. 8.5 Forest plot showing heterogeneity of sensitivity results across a dummy set of data. The overall pooled sensitivity result is shown as a diamond shape representing the confidence interval
1 2 3 4 5 6
0.81 (0.54 - 0.96) 0.81 (0.63 - 0.93) 0.43 (0.10 - 0.82) 0.89 (0.78 - 0.95) 0.91 (0.76 - 0.98) 0.76 (0.53 - 0.92) Pooled Sensitivity = 0.84 (0.77 to 0.89)
0
0.2
0.4
0.6 Sensitivity
0.8
1
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
sensitivities and specificities. This is valid only when the same test criteria, population and clinical setting have been used in each study, and each study is of similar size and quality. If different criteria, or thresholds, have been used, there will be a relationship between sensitivity and specificity across the studies. As sensitivity increases, specificity will generally drop (threshold effect). In these cases, weighted averages will not reflect the overall accuracy of the test, as extremes of threshold criteria can skew the distribution. Separate calculations of overall sensitivity and specificity tend to underestimate diagnostic test accuracy, as there is always interaction between the two outcomes. DOR is the statistic of choice to measure the overall performance of a diagnostic test.
8.6.3.1 Summary Receiver Operating Characteristic (SROC) Analysis If the test results are available in binary form, 2 × 2 tables with TP, FP, TN and FN can be formed for each primary study. The [TPR, FPR] pair from each study can be plotted onto a pair of axes similar to that of a ROC curve. In a ROC curve, the data points are formed by variation of the diagnostic threshold within the same
sensitivity 1
SROC Curve
population. In meta-analytical curves, each data point is formed by the (TPR,FPR) result from a primary study. The scatterplot of data points formed from these results is termed a summary ROC (SROC) curve (Fig. 8.6). The curve which is mapped onto the graph is calculated through regression methods. The principles are that logit(TPR) and logit(FPR) have a linear relationship, and that this relationship can be exploited with a line of best fit (regression techniques). Logit(TPR) and logit(FPR) are defined as Logit (TPR) = log {TPR /(1 − TPR)} Logit (FPR) = log {FPR /(1 − FPR)} (where log represents the natural logarithm). As there is no logical reason to favour logit(TPR) or logit(FPR), Moses et al. proposed using linear combinations of the two variables as the dependent and independent variables in the regression equation. This also neatly solves the dilemma about the different “least squares” solutions, which would result from choosing one or the other logit. D = a + bS D = logit (TPR) − logit (FPR) S = logit (TPR) + logit (FPR), where α is the intercept value, and β represents the dependence of test accuracy on threshold D is equivalent to log (DOR) or the diagnostic log odds ratio. S inversely measures the diagnostic threshold. High S values correspond to low diagnostic thresholds. The above equation maps D against S on a linear axis curve, and the least squares line of best fit is fitted to the data. Once a and b are calculated from the intercept and slope of the D-S line, the model can be transformed back into the plane of (TPR,FPR), according to the equation
0.9 0.8 0.7 0.6 0.5 0.4
TPR =
0.3 0.2 0.1 0
95
0
0.2
0.4 0.6 1-specificity
0.8
Fig. 8.6 Example of SROC curve showing sensitivity (TPR) vs. (1-specificity) (FPR) plotted over the unit square. The diagonal represents the random test curve. The antidiagonal running from the top left corner to the bottom right corner intersects the curve at the value of Q*. The area under the curved line is AUC
exp (a / (1 − b ))[FPR / (1 − FPR)] (1± b )(1−b ) . 1 + exp (a / (1 − b ))[FPR / (1 − FPR)](1+ b )(1− b )
The curve of TPR against FPR can now be plotted over the data points, as both a and b are known from the line of best fit in the logit plane. Calculation of the area under the curve (AUC) is performed by integration of the above equation over the range (0,1). If the range of raw FPR data points is small, it may be necessary to perform a partial AUC over the range of data. This is acceptable if the specificity of the test can be assumed
96
C. M. Jones et al.
to be similar in the target population which will undergo the test in practice. Further information regarding calculations of the SROC curve can be found in Moses et al. [6] or Walter [7]. Examples of SROC analysis are provided in Jones and Athanasiou [8]. The statistic Q* is another summary measure of accuracy derived from the SROC curve. Q* is the point on the curve which intercepts the anti-diagonal and corresponds to the point where sensitivity and specificity are equal. Numerically, Q* is defined as Q* = exp(a/2) / [1 + exp(a/2)]. where α is the intercept value The Q* value lies between 0 and 1 and represents a point of indifference – the probability of correct diagnosis is the same for all subjects. Q* is an appropriate statistic provided that the clinical importance of sensitivity and specificity is approximately equal. After weighing up the importance of the two outcomes, it may be felt that they are not of equal importance, and Q* would no longer be an appropriate summary statistic. There are several disadvantages of the traditional SROC analysis. Firstly, it is impossible to give a summary estimate of sensitivity and specificity, as they are the independent and dependent variables in the exponential curve model. This means that AUC and Q* are used as summary values of overall accuracy, which are less well understood and need to be explained when used in the literature. It is the estimated sensitivity and/ or specificity which often help the clinician choose a diagnostic test, and the clinical scenario may favour a test with high sensitivity or specificity, but not necessarily both. SROC does not facilitate calculations of overall summary estimates based on one or the other. Similarly, LRs cannot be produced by SROC. Secondly, the effect of diagnostic threshold is not modelled, so if the accuracy is dependent on threshold, the model is biased and the curve appears asymmetrical. Additionally, threshold is not allowed to vary in the interpretation and application of the results to particular populations.
8.6.3.2 Bivariate Approach to SROC The bivariate approach was applied to diagnostic tests by Reitsma et al. [9]. The bivariate model uses the supposition that the outcome(s) of interest may be multiple and co-dependent, in the case of sensitivity and
specificity of a diagnostic test. As well as keeping the two-dimensional principles of sensitivity and specificity intact, this approach allows a single variable to be factored into the models of both outcomes. Statistically, the bivariate model is a random-effects model in which each of logit(sensitivity) and logit(specificity) is assumed to follow a normal distribution across the studies included in the meta-analysis. There is also assumed to be a correlation between the two logit-transformed variables. The presence of two dependent normally-distributed variables leads to a bivariate distribution. The variance of the two outcomes is expressed as a matrix of individual variances and the covariance between the two. Unlike SROC analysis, the results of bivariate models are expressed as summary estimates of sensitivity and specificity and their confidence intervals. As the underlying distributions are assumed to be normal, a linear function can be used to perform the analysis, allowing readily available software to be used. In addition, the use of random-effects makes estimating interstudy heterogeneity in either sensitivity or specificity straightforward. The ability to examine the effect of a covariate, for example study design or inclusion criteria, on sensitivity or specificity, rather than the overall DOR, is also an advantage. The bivariate model is, however, more complex than the SROC model, and if further reading is desired, the papers by Reitsma et al. [9] and Harbord et al. [10] are recommended.
8.6.3.3 Hierarchical SROC Analysis Hierarchical SROC (HSROC) performs multiple layers of modelling to account for heterogeneity both within and between primary studies by modelling both accuracy and threshold as random effects variables. At withinstudy level, the number of positive results is modelled on a binomial distribution, incorporating threshold and accuracy as random effects and the interaction between them as a fixed effect. This is identical to the traditional SROC model if there is no dependence of accuracy upon threshold. A second level of modelling uses the estimates of accuracy and threshold (assuming a normal distribution) to calculate the SROC curve, as well as expected sensitivity, specificity, LRs and other desirable endpoints like AUC. Like the bivariate model, the HSROC model assumes normal distributions of the underlying variables;
8
Diagnostic Tests and Diagnostic Accuracy in Surgery
however, the focus of HSROC analysis is curve generation with emphasis on the shape of the curve. The complexity of layered non-linear modelling has prevented widespread use of HSROC. A Bayesian approach using Markov Chain Monte Carlo randomisation methods required a further level of modelling just to run the analysis [11] and proved the validity of the model, without making it accessible. New software procedures like NLMIXED in SAS have enabled easier model application. The model requires individualised syntax and a grasp of both SAS and the statistical models. However, useful summary estimates are produced for sensitivity, specificity and LRs as well as the HSROC curve. The AUC and Q* estimates can be easily calculated using SAS macros. The Bayesian estimates have been shown to closely approximate the results produced with SAS [12]. The HSROC technique is able to compare multiple tests from the same studies in the same analysis, meaning that the differences between the study characteristics are present for both tests. Threshold, accuracy and SROC curve shape are estimated separately for each test, increasing the number of variables in the model. The limiting factor in HSROC analysis is often the capacity of the data to converge to a result. The SROC shape variable may be considered the same for each test type if difficulties in convergence arise. Covariates allow heterogeneity to be investigated and explained by different aspects of study design or patient spectrum. There is far more flexibility in HSROC compared to traditional SROC in terms of explaining different results. Both intra- and inter-study variables can be included as covariates. The HSROC curve is plotted on the same axes and has the same layout as the SROC curve. Curve asymmetry can be quantitatively assessed through shape variables to determine the influence of threshold on test accuracy.
8.6.3.4 Comparing the Bivariate and HSROC Models The results of HSROC and bivariate analysis are often very similar, and in the case where there are no studylevel covariates, they are identical. The bivariate model allows covariates that influence sensitivity and specificity (or both), whilst the HSROC model allows covariates that influence threshold or accuracy (or both). If a covariate is assumed to affect both variables in either model,
97
then the two models are the same. The HSROC method, however, allows greater flexibility in dropping variables from the model as it allows greater control over the choice of included covariates compared to the fairly standard framework of the bivariate approach. The bivariate model, however, can be fitted using a variety of available software without the need to use the NLMIXED function in SAS or the cumbersome WinBUGS software. Further information is available in Harbord et al. [10].
8.7 Conclusions There are many papers in the medical literature which produce conclusions on diagnostic test accuracy. Appropriate application to clinical practice depends on the clinician’s ability to identify the strengths and weaknesses in the study and decide whether the results apply to their clinical practice. Diagnostic accuracy is not measured in the same way in all studies. Most will quote sensitivity and specificity, LRs or DOR, and these concepts should be familiar to every clinician when reading papers on diagnostic accuracy. Meta-analytical papers are becoming more stringently reviewed with the increasing acceptance of the QUADAS tool and published guidelines for the performance of diagnostic meta-analysis. Nevertheless, the clinician should remain vigilant for papers which do not meet these guidelines, and at least have the understanding (or this book as a reference tool!) to make the most appropriate decisions in clinical practice.
References 1. Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction (oxford statistical science series. Oxford University Press, New York 2. Irwig L, Tosteson AN, Gatsonis C et al (1994) Guidelines for meta-analyses evaluating diagnostic tests. Ann Intern Med 120:667–676 3. Begg CB, McNeil BJ (1988) Assessment of radiologic tests: control of bias and other design considerations. Radiology 167:565–569 4. Whiting P, Rutjes AW, Reitsma JB et al (2003) The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Methodol 3:25 5. Westwood ME, Whiting PF, Kleijnen J (2005) How does study quality affect the results of a diagnostic meta-analysis? BMC Med Res Methodol 5:20 6. Moses LE, Shapiro D, Littenberg B (1993) Combining independent studies of a diagnostic test into a summary ROC
98 curve: data-analytic approaches and some additional considerations. Stat Med 12:1293–1316 7. Walter SD (2002) Properties of the summary receiver operating characteristic (SROC) curve for diagnostic test data. Stat Med 21:1237–1256 8. Jones CM, Athanasiou T (2005) Summary receiver operating characteristic curve analysis techniques in the evaluation of diagnostic tests. Ann Thorac Surg 79:16–20 9. Reitsma JB, Glas AS, Rutjes AW et al (2005) Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J Clin Epidemiol 58:982–990 10. Harbord RM, Deeks JJ, Egger M et al (2007) A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 8:239–251 11. Rutter CM, Gatsonis CA (2001) A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations. Stat Med 20:2865–2884
C. M. Jones et al. 12. Macaskill P (2004) Empirical Bayes estimates generated in a hierarchical summary ROC analysis agreed closely with those of a full Bayesian analysis. J Clin Epidemiol 57:925–932
List of Useful Websites 1. Loong T. (2003) Understanding sensitivity and specificity with the right side of the brain. BMJ 2003;327:716–719. http://www.bmj.com/cgi/content/full/327/7417/716 2. Altman DG, Bland JM (1994) . 3. Statistics notes: diagnostic tests 1: sensitivity and specificity. BMJ 1994;308:1552. http://www.bmj.com/cgi/content/full/ 308/6943/1552 4. Altman DG, Bland JM (1994) Statistics notes: diagnostic tests 2: predictive values. BMJ 1994;309:102. http://www. bmj.com/cgi/content/full/309/6947/102
9
Research in Surgical Education: A Primer Adam Dubrowski, Heather Carnahan, and Richard Reznick
Contents 9.1
Introduction ............................................................
99
9.2
Qualitative vs. Quantitative Research .................. 100
9.2.1 Generating Questions ............................................... 100 9.2.2 Qualitative Research................................................. 101 9.2.3 Quantitative Research............................................... 102 9.3
Research Design...................................................... 103
9.3.1 Minimizing Threats to Validity ................................ 104 9.3.2 Design Construction ................................................. 104 9.3.3 The Nature of Good Design ..................................... 106 9.4
Measures (Experimental Research) ...................... 106
9.4.1 9.4.2 9.4.3 9.4.4
Developing an Instrument ........................................ Feasibility ................................................................. Validity ..................................................................... Reliability .................................................................
9.5
Acquisition and Analysis of Data (Experimental Research) ....................................... 110
9.5.1 9.5.2 9.5.3 9.5.4 9.5.5 9.5.6 9.5.7 9.5.8
Data Collection......................................................... Tests of Normality .................................................... Three Normality Tests .............................................. Categories of Statistical Techniques ........................ Nonparametric Analyses .......................................... Relationships Between Variables ............................. Differences Between Independent Groups ............... Differences Between Dependent Groups .................
9.6
Funding, Dissemination, and Promotion.............. 113
107 107 108 108
110 110 111 111 112 112 112 112
References ........................................................................... 114
A. Dubrowski () Centre for Nursing Education Research, University of Toronto, 155 College Street, Toronto, ON, Canada M5T 1P8 e-mail:
[email protected] Abstract The field of surgical education is young, and opportunities are not on the same scale as they are in fields of fundamental biology, clinical epidemiology, or health care outcomes research; however, the trajectory is steep, and educational work is improving in its sophistication and adherence to methodological principles. There is also an excitement and desire among young academic surgeons to work in an area that has obvious and direct relevance to their “mainstream job” as surgeons. This chapter explores approaches that can bring such evidence to bear upon educational questions, since education, like any other discipline, cannot be subject to practice by anecdote. Changes must be made on the basis of methodologically rigorous and scientifically sound research.
9.1 Introduction Surgical education has come of age. The past 30 years have seen efforts aimed at a scientific understanding of education in surgery mature from the level of minor curiosity to emerge as bona fide academic focus. Convincing evidence supports this observation, with accomplishments in the field of surgical education regularly being considered a criterion for promotion, with graduate training in education becoming more common for aspiring academic surgeons, and with an increasing number of surgery departments hiring surgeons who will principally function in the educational arena. Of course, interest in educational theory and research is not new. For over a century, the field of psychology was the “academic home” of educational research. The last three or four decades, however, have witnessed an explosion in the quantity, diversity, and quality of efforts in education. Disciplines such as cognitive science,
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_9, © Springer-Verlag Berlin Heidelberg 2010
99
100
kinesiology, engineering, and the health professions have invested deeply in educational research as a fundamental expression of their discipline. Such activity has translated into increased opportunities for individual academics wishing to focus their work in education. In the health professions alone, there are approximately twenty institutions now offering graduate degrees tailored to the needs of academic professionals. Hundreds of journals focus on education, providing a forum for scientific writing in the field, and conferences concentrating on issues of medical and surgical education abound in the America, Europe, and Australasia. To be sure, the field of medical and surgical education is young, and opportunities for scholarship are not on the same scale as they are in fields of fundamental biology, clinical epidemiology, or health care outcomes research; however, the trajectory is steep, and educational work is improving in its sophistication and adherence to methodological principles. There is also an excitement and desire among young academic surgeons to work in an area that has obvious and direct relevance to their “mainstream job” as surgeons. This bourgeoning interest in surgical education has paralleled the skyrocketing attention being paid to the more general field of medical education fueled by a number of seminal issues. Of all these issues, the advent of problem-based learning [1], a curricular methodology which swept North America and Europe, has probably had the greatest impact on the delivery of surgical education in the last 30 years. The use of computer-assisted instruction, which has changed our thinking about information transfer, has also had a significant effect, and we are most likely now on the cusp of further revolutionary changes affected by the digital delivery of knowledge [2]. A further important wave of change has come in the form of a formal focus on communication skills; this focus on “noncognitive” competencies has, to a large extent, been responsible for the birth of the entire discipline of performance-based testing, with instruments such as the Objective Structured Clinical Examination (OSCE) [3]. Lastly, a transformative change is now occurring in the direction of interprofessional education. With the breaking down of traditional hierarchies, medicine of the future will be delivered by teams of health professionals; and health professional learning will, likewise, be accomplished through an interprofessional effort.
A. Dubrowski et al.
Concurrent with these more general issues in medical education, some sweeping changes have permeated the domain of surgical education. The first global change witnessed has been a decrease in the luster of the surgical specialties. Today’s medical students, voting with their “choice of residencies,” have declared an overwhelming interest in the so-called “controlled lifestyle specialties,” an interest that has led to a decrease in surgery as a career choice. Also, concern about “over-work” in the surgical workplace has escalated with the notoriety of the Libby Zion case [4] and with subsequent dramatic changes in the realm of workhour restrictions for house staff. This issue has become entrenched in practice through mechanisms such as the ACGME 80-h work week in the United States and the European working-time directive [5–7]. Another major change in surgical education has been the increasing popularity of adjunctive environments for surgical learning, most notably the skills laboratories, which provide instruction in “the technical” fundamentals of surgery. For surgeons interested in education, this has been an awakening; and the introduction of serious academic research venues congruent with their skills, talents, and passions has promoted a huge amount of research activity. The interest generated by these various issues must be viewed in conjunction with another emphasis now found in surgical inquiry, namely, that change needs to be guided by evidence. This chapter explores approaches that can bring such evidence to bear upon educational questions, since education, like any other discipline, cannot be subject to practice by anecdote. Changes must be made on the basis of methodologically rigorous and scientifically sound research.
9.2 Qualitative vs. Quantitative Research 9.2.1 Generating Questions Framing a good research question is the first step and one of the most important factors in designing a good research project and research program – a well formulated question that is searchable and testable will most likely lead to a successful exploration of the literature. In the question formulation stage, two pitfalls need to
9
Research in Surgical Education: A Primer
be avoided. One of the most common pitfalls encountered by a researcher immersed in a specific clinical context is that his or her research question often springs from anecdotal evidence and observations rather than from existing literature and evidence; the researcher can avoid this pitfall by framing the question carefully, ensuring that any personal bias and experiential contextualization are removed. The second pitfall is that the question generated may lack focus. By dividing the question into its component parts at the very beginning, the researcher can avoid this situation and will formulate a highly focused question, which not only avoids unnecessary, complicated, and time-consuming searches of irrelevant material, but also facilitates the discovery of the best evidence in the literature and the generation of a set of testable hypotheses. Ultimately, the structuring of the question is a function of both personal style and the nature of the research to be conducted. For example, a hypothesis-driven question will be highly structured, whereas a question leading to exploratory research will be less focused to allow the researcher to explore a wider scope. Qualitative research questions should be structured to address how people “feel” or “experience” certain situations and conditions; quantitative research questions should be structured to answer questions of “how many” or “how much” [8–10].
9.2.2 Qualitative Research Qualitative research addresses the questions dealing with perceptions of situations and scenarios. The most common qualitative approaches in medical education involve grounded theory, ethnography, and phenomenology. However, since qualitative research is not the focus of this chapter, we only briefly describe the aims and methodologies of these three approaches. The grounded theory approach aims to discover the meanings that humans assign to people and objects with which they interact, in order to link these meanings with observed behaviors [11]. Using interviews and observations to collect data, this is one of the most common approaches in medical education. Researchers with backgrounds in rhetoric, sociology, and psychology who engage in qualitative research using grounded theory approach have made tremendous contributions to our understanding of meanings and behaviors within the
101
context of medical education. Lorelei Lingard [12] and her team, for instance, have studied interprofessional communication patterns within various clinical settings such as the operating room, the intensive care unit, and the pediatric medicine inpatient ward. In particular, their work explores the intersection between team communication patterns and the issues of novice socialization and patient safety. On the basis of this research, we are now beginning to understand how certain communication patterns lead to more effective team performance within a clinical setting, and it is hoped that we can structure our future educational interventions to stimulate favorable patterns of communication. Ethnography aims to study culture from the people who actually live in a culture [13]. In the case of medical and surgical education, ethnography is a study of medical students and residents who live in the culture of medical education. This approach can be described as a very intensive, face to face encounter between the researcher and the study culture; during fieldwork, data are generally gathered by participation in social and cultural engagements. The third approach, phenomenology, aims to achieve a deeper understanding of the nature of the meaning of the everyday lived experience of people. This approach demands an indepth immersion of the researcher in the world of the study population. Asking a well-structured qualitative research question is a very difficult and demanding task. Two essential elements are necessary in the formulation of a qualitative research question: the population to be studied and the situation within which the population is being studied (see Table 9.1). The question should address the characteristics of the specific population that will be tested or assessed: Are they part of a larger group of individuals who may be different from the rest of the population? Is there a specific grouping, such as sex, or age, or level of training? What is the specific environment in which the problem is being studied? The question should also address the situation or context in which the population will be observed: What are the circumstances, conditions or experiences that are known to affect the population and which of them still need to be explored? Working example: “Communication patterns affect the atmosphere of the operating room?” Although on the surface this question appears well structured, it does not adhere to the principles of generating the optimal qualitative research question. Specifically, this
102
A. Dubrowski et al.
Table 9.1 Key features Key of testable and searchable characteristic research questions Population
Qualitative
Quantitative
Who are the participants? From where? What are their preexperimental features?
Who are the participants? From where? What are their preexperimental features?
Interventions
What are the experimental interventions and what are the control interventions?
Outcomes
What are the expected outcomes? Who will assess these and how?
Situation
What circumstances, conditions, or experiences is the researcher interested in learning more about?
question does not define the participants and the situation. If we assume that the researchers are interested in the assessments of communication patterns between the surgeon, the senior resident, and the nurse, as well as their perceptions of operating room atmosphere, that the observations are made in a virtual operating room, and finally that all parties involved went through a training session on how to improve their communication patterns, a more defined question would be: “Does interprofessional training in communication patterns lead to perceptions of improved cohesion within the operating room?: Simulation study.”
9.2.3 Quantitative Research Arguably, the majority of the research conducted in the field of medical and surgical education is quantitative in nature, typically answering questions of “how many” and “how much.” For example, when comparing two different educational approaches, the researcher can ask questions of the difference in the amount of learning that occurred when the participants were exposed to one of the two educational interventions. The quantitative research approach allows one to generate precise and focused research questions and highly testable hypotheses. Several approaches to formulating quantitative research studies have been proposed. Collectively termed “true experimental designs,” they include randomized controlled trials and observational studies. The randomized controlled trial consists of a random allocation of study participants into various experimental groups. The purpose of the randomization
procedure is to ensure that any confounding variables that are not anticipated or controlled by the researcher will be equally distributed among all the experimental groups. In some cases, the researcher may decide to conduct a pretest after the initial randomization. The pretest confirms that all participants show similar performance characteristics on the skill of interest; therefore, any improvements noted after the experimental intervention are likely due to the intervention rather than chance. Once participants are allocated to the various experimental groups, their performances are assessed forward in time to determine whether the educational experience has the hypothesized effect. Typically, this is assessed with an immediate posttest. The common mistake among educational researchers is the assumption that the posttest indicates the amount of learning that has occurred because of the educational intervention. Based on the learning literature and theories, the immediate posttest should be treated as an indication of the improvement in the performance, rather than learning [14]. More specifically, there are sets of variables that may influence performance on this immediate posttest: boredom, fatigue, and diminishing interest in learning may have a negative effect on the immediate performance. In contrast, excitement, recency, and group cohesion may be variables that have a positive, facilitating impact on the performance. To circumvent the negative as well as the positive transient effects of practice, it has been hypothesized that the true measure of learning should be assessed by a delayed posttest. Therefore, introducing a retention period between end of practice and assessment of knowledge allows the negative and positive variables to dissipate, revealing the true amount of change in skill performance due to the educational intervention.
9
Research in Surgical Education: A Primer
The strength of the randomized controlled trials, when applied to medical and surgical education, is that they control for many unanticipated factors. However, this approach does not allow the researcher to study larger-scale education interventions, such as the effectiveness of various educational curricula. Observational studies may provide better evidence than the randomized controlled trials when addressing this type of question. In particular, cohort studies are a type of observational study that may be used to compare the performance of students undergoing a specific curriculum to performances of other students undergoing a different curriculum. This comparison could be made across various educational institutions or various historical cohorts. Another less common approach to observational studies used in research in medical and surgical education is case–control studies. This approach requires that the researcher identifies outcomes or performance characteristics in a specific population of trainees and tracks back in time the educational exposures in order to find an indicator of this outcome. One possible example is the observation that a group of residents who perform well on a set of basic surgical skills were members of the surgical skills interest group in the first and second years of their undergraduate medical education, which allowed them to participate in many research studies as well as in numerous workshops of surgical skills. Overall, observational studies generate weaker evidence than the randomized controlled trials because of their inability to control intervening variables; sometimes, however, this is the only systematic experimental approach for answering certain questions of interest. The generation of the quantitative research question for searching the literature and conducting research needs different structuring than does generating a qualitative question (refer to Table 9.1). First, although the nature of the population that one wants to study still needs to be addressed; the additional components necessary for the generation of searchable and testable research questions are the specification of the interventions and the expected outcomes. Second, a crucial step in formulating questions addressing the intervention is the inclusion of the appropriate control groups. For example, when investigating the effectiveness of a specific type of simulation-based training, one control group may include didactic training. However, the inclusion of this control group does not address the question of the effectiveness of the particular simulation-based training, but rather the effectiveness of
103
hands-on practice in general. A more effective control group would practice the same skill in an alternative simulated environment. The inclusion of this control group not only addresses the question of whether handson practice within the simulated environment improves technical skills performance, but also assesses the effectiveness of the particular simulated approach when it is compared to other simulated approaches. Third, a key feature in formulating quantitative research question is addressing the outcomes. Whether performing a literature search or generating a testable hypothesis, the researcher needs to know which aspects of performance should be assessed, who should make these assessments, and when they should be made. Working example: “Does hands-on practice improve technical performance of anastomotic skills?” Although on the surface this question appears well-structured, it does not adhere to the outlined principles for generating quantitative research question. More specifically, the elements of the question do not define the population of participants, the specifics of the experimental intervention, or details about the outcomes. A question that is searchable and testable would include all three elements. If one assumes that the researchers were interested in the comparison of learning of the bowel anastomotic skills in junior and senior residents, and that they wanted to compare the effectiveness of highfidelity and low-fidelity training, and finally that they used expert-based ratings systems to assess the clinical outcomes such as leakage of the anastomosis, a more defined question would be: “Does practice on high- and low-fidelity models affect differentially the final products of junior and senior general surgery residents?”
9.3 Research Design Today’s educational research in medicine and surgery is devoted to examining whether specific educational programs or manipulations improve clinical performance. For example, researchers may wish to examine if a new educational program leads to more proficient performance in the clinical setting. The existence of such a cause–effect relationship requires that two conditions be met [15]. First, the researcher must observe changes in the outcome after, rather than before, the institution of the program. Second, the researcher must ensure that the program is the only reasonable explanation for the
104
changes in the outcome measures observed. If there are any other alternative explanations for the observed changes in outcomes, the researcher cannot be confident that the presumed cause–effect relationship is correct. Undeniably, in most educational research, showing that no alternative explanations can be applied to the findings is very difficult. There are many factors outside the researchers’ influence which may explain changes in outcomes [15, 16]. Some examples include other historical or ongoing events occurring at the same time as the program, which may have a direct or indirect impact on learning, improvement of skills, or development of a knowledge base. Cumulatively, these perceived or unperceived alternate explanations to the way the findings are being interpreted by the investigators are known as threats to internal validity. One possible way to eliminate or mitigate these threats is to maintain a rigorous approach to research design and methodology [15, 17].
A. Dubrowski et al.
Design is by far the most powerful method to rule out alternative explanations: this approach, therefore, warrants a more detailed formal expansion in the following section. Still, a number of alternative explanations cannot be eliminated by implementing research design strategies. To deal with these, the researcher can use various statistical analyses performed on the collected data. For example, a recent study by Brydges et al. [15] investigated the relationship between postgraduate-year of training and proficiency in knot tying skills. It was possible that some of the senior residents spent more time in the operating room than did junior residents, which was considered an uncontrolled variable; the actual number of hours spent in the operating room by every participant, therefore, was collected from their logbooks and used as a covariate in subsequent analyses. The results showed that the year of training, not the actual number of hours spent in the operating room, explained the differences in skill proficiency.
9.3.1 Minimizing Threats to Validity Minimizing threats to internal validity is an essential feature of any well-constructed experiment. In this section, we first discuss three approaches: argument, analysis, and design. These three approaches are not mutually exclusive; indeed, a good research plan should make use of multiple methods for reducing threats to validity. Ultimately, the better the research design, the less the threat to validity; therefore, we will expand on this issue with an outline of some common experimental designs, along with their strengths and weaknesses. It is common to rationalize or argue why a specific threat to internal validity may not merit serious consideration, but this is the least effective way to deal with threats to internal validity. This approach should only be used when the particular threats to validity have been formally investigated in prior research. For example, if an improvement in skill proficiency due to a specific educational program is revealed by assessments made by a single expert using a standardized set of assessment instruments, one may argue that a threat to the internal validity of the findings due to a single assessor is not likely because previous research shows that when performance is evaluated with the same set of assessment instruments by a number of experts, the assessments tend to be very similar.
9.3.2 Design Construction As already mentioned, proper research design is by far the most powerful method to prevent threats to internal validity. Most research designs can be conceptualized and represented graphically with four basic elements: time, intervention(s), observation(s), and groups. In design notation, time is represented horizontally, intervention is depicted with the symbol “X,” assessments and observations are depicted by the symbol “O,” and each group is indicated on a separate line. Most importantly, the manner in which groups are assigned to the conditions can be indicated by a letter: “R” represents random assignment, “N” represents nonrandom assignment (i.e., a nonequivalent group or cohort), and a “C” represents an assignment based on a cutoff score. The most basic causal relationship between an education intervention and an outcome can be measured by assessing the skill level of the particular group of trainees after the implementation of the educational intervention. Using the outlined notation, the research design would be the following: X
O
9
Research in Surgical Education: A Primer
This is the simplest design in causal research and serves as a starting point for the development of better strategies. This design does not control well for threats to the internal validity of the study. Specifically, one cannot confidently say that the observed skill level at the end of the study is any different from the skill level of the participants before the study. One also cannot rule out any historical events that may lead to changes in performance. When it is possible to deliver the educational intervention to all participants, a number of strategies are available to control for threats to internal validity of the study. One can include additional observations either before or after the program, add to or remove the program, or introduce different programs. For example, the researcher might add one or more preprogram measurements: O
O
X
O
The addition of such pretests provides a “baseline,” which, for instance, helps to assess the potential of maturation or testing threat. Specifically, the researcher can assess the differences in the amount of improvement on a skill between the first two tests and between the second pretests and the posttests. Similarly, additional posttest assessments could be added, which would be useful for determining whether an immediate program effect decays over time or whether there is a lag in time between the initiation of the program and the occurrence of an effect. However, repeated exposures to testing can potentially influence the amount of learning that occurs due to the actual educational intervention. Therefore, conclusions related to these assessments may overestimate the educational impact of the intervention. For this reason, it is suggested that a control group be included in the study design if possible. One of the simplest designs in education research is the two-group pretest/posttest design:
N
O
N
O
X
O
105
assignment of participants to each of the groups. There was an initial assessment of skill level before the program implementation (indicated by the first “O”). Because the participants were not randomly assigned to the two groups, the initial test of the ability to perform the skill was crucial to ensure that any differences found due to the educational program were not present before the study. Subsequently, participants in the first group received the educational program (indicated by X), while participants in the second group did not. Finally, all participants’ skill levels were measured after the completion of the program. Sometimes, the initial pretests of the ability to perform the skills in question may be viewed as a contaminating factor. That is, the exposure to the test may be enough practice for the group which did not receive the educational intervention to learn the skills. Therefore, whenever possible, one should avoid this initial pretest. The posttest-only randomized experimental design depicted below allows the researcher to assume that the randomized assignment of the participants to the two groups ensures an equal level of knowledge across the two groups; therefore, no pretests are necessary, and all the differences observed at the end of the study should be attributed only to educational intervention: R
X
R
O O
Educational researchers argue that neither of the two posttest-only designs adequately describes the amount of learning that occurs during the educational intervention [14]. The immediate posttest results, they argue, may be influenced by many transient factors such as fatigue, boredom, or excitement. One strategy to circumvent any influences on the results due to these transient factors is the introduction of a delayed posttest, in which case, the researcher allows participants to take a rest from the study, typically termed a retention interval (this interval may vary, but it should definitely be longer than the educational intervention):
O R
The interpretation of this diagram is as follows. The study was comprised of two groups, with nonrandom
R
X
O
O
O
O
106
A. Dubrowski et al.
While these approaches are a valid and functional first approximation of the effectiveness of educational programs, they are limited. Specifically, they assess the effectiveness of an educational intervention or program when compared to a lack of alternative educational intervention. The much more challenging and more informative approach is to assess the effectiveness of an educational intervention when compared to a different intervention: O
X1
O
O
X2
O
For example, two groups shown in this diagram are assessed prior to the intervention and after the intervention, and each group undergoes a different type of intervention. Assuming that the performance on the skill and amount of knowledge is the same across the two groups on the pretest, any differences found on the posttest would be a consequence of the specific intervention. Frequently, the inclusion of additional groups in the design may be necessary in order to rule out specific threats to validity. For example, the implementation of an educational program within a single educational institution may lead to unwanted communication between the participants, or group rivalry, possibly posing threats to the validity of the causal inference. The researcher may be inclined to add an additional nonequivalent group from a similar institution; the use of many nonequivalent groups helps to minimize the potential of a particular selection bias affecting the results: R
O
X
O
R
O
O
N
O
O
Cohort groups may also be used, in a number of ways. For example, one could use a single measure of a cohort group to help rule out a testing threat:
N
O
R
O
R
O
X
O
In this design, the randomized groups might be residents in their first year of residency, while the cohort group might consist of the entire group of first year residents from the previous academic year. This cohort group did not take the pretest and, if they are similar to the randomly selected control group, they would provide evidence for or against the notion that taking the pretest had an effect on posttest scores. Another possibility is to use pretest/posttest cohort groups: N
O
N
O
X
O O N
O
Here, the treatment group consists of first year residents, the first comparison group consists of second year residents assessed in the same year, and the second comparison group consists of the following year’s first year resident (i.e., the fourth year medical students at the time of the study year).
9.3.3 The Nature of Good Design In the preceding section, we have proposed several generally accepted research designs that are particularly applicable to educational research; still, we encourage researchers to be innovative in order to address their specific questions of interest. The following strategies may be used in order to develop good, custom-tailored designs. First, and most important, is that a research design should reflect the theories being investigated and should incorporate specific theoretical hypotheses. Second, good research design should reflect the settings of the investigation. Third, good research design should be very flexible; this can be achieved by duplication of essential design features, though it should maintain a balance between redundancy and over-design.
9.4 Measures (Experimental Research)
O
In most examples of quantitative educational research, there are two categories of variables: dependent and independent. Dependent variables are the measures
9
Research in Surgical Education: A Primer
that will vary or be impacted in some fashion, according to changes or fluctuations in other parameters or independent variables. For example, if we wanted to study the impact of medical school grades (GPA) and standardized admission tests (MCAT) on performance in surgical residency, we might set up a correlational study in which we define success in surgical residency as the score on a final exam (FINAL) at the end of training. We would then define a regression equation which would evaluate the extent to which the independent variables of GPA and MCAT could predict the dependent variable of FINAL. While the selection of independent variables is critical for effectively addressing a research question, the appropriateness of the measures used to quantify performance is just as critical. One of the challenges of educational research is to identify and develop dependent variables that adequately represent whether learning has taken place. Approaches to this involve demonstrating that an existing tool is valid, or if an existing tool does not yet exist, embarking upon the creation and validation of such a tool.
9.4.1 Developing an Instrument The first phase in the development of a new tool involves getting experts to pool their knowledge. One approach used to achieve this is called the Delphi technique, developed by the RAND Corporation in the late 1960s as a forecasting methodology. The Delphi technique has developed into a tool that allows a group of experts to reach consensus on factors that are subjective [18–21]. This technique involves several steps. The first step is the selection of a facilitator to coordinate the process. This is followed by the selection of a panel of experts who will develop a list of criteria; this step is called a round. There are no “correct” criteria, and input from people outside the panel is acceptable. Each member of the panel independently ranks the criteria, and then for each item on the list, the mean value is calculated, a mean ranking is calculated, and the item is listed according to its ranking. The panel of experts can then discuss the rankings, and the items are anonymously reranked until stabilization of rankings occurs. If we are to use the example of the development of an evaluation tool for the performance of a skill like
107
Z-plasty, a group of experts would identify the most important steps in this process. This information would be collected from the experts, and then the steps would be ordered and distributed to the group for ranking. This process does not necessarily require a face to face meeting, but could be conducted through e-mail. Following this, the experts would rank the importance of each step in the procedure. These results would be averaged and redistributed by the facilitator for a second ranking. If stabilization of the ranking of the steps was achieved, thenthe list would be completed. It is argued that the expert-based measures are always going to be subjective and prone to contributing the error variance associated with the human perceptual and decision making system. In response to this, the development of computer-based measures has been evolving. An example of this is the use motion analysis systems which are being used to obtain measures of hand motion efficiency (movement time and number of movements). These measurement systems have been shown capable of discriminating between expert and novice surgical performance [22–24]. Regardless of how a measure is generated, it must then be scrutinized in terms of its basic psychometric properties. It is generally thought that all competence measures need to be feasible, valid, and reliable.
9.4.2 Feasibility All measures of competence or new testing modalities must be achievable in terms of logistics, costs, and manpower. This is as important in the paradigm of experimentation as it is in the elaboration of an evaluative mechanism for a program. For example, some assessment methods, such as a multiple-choice examination, are cost-efficient, easy to deliver, easily scorable and deliverable through a variety of mechanisms, such as paper and pencil, digital delivery, or web administration. Others, such as an OSCE, can be costly, labor-intensive, and logistically challenging. There is almost always a fine balance between feasibility considerations, reliability, and validity. Generally speaking, especially in terms of validity, the more valid instruments are also among the most costly and logistically challenging.
108
9.4.3 Validity The next step in the process of developing a dependent variable useful for educational research involves establishing the validity and reliability of this newly established instrument. Validity reflects the degree to which an instrument actually measures what it is intended to measure, and reliability refers to the repeatability of a measure. If an instrument is found to be unreliable, then it cannot be considered valid, though it is also possible for an instrument to be reliable and still not be valid since it is repeatedly measuring the wrong construct. There are four types of validity: logical validity, content validity, criterion validity, and construct validity, each of which requires explication. Logical validity is established when there is an obvious link between the performance being measured and the nature of the measurement instrument. For example, a measure of patient safety logically should be linked with improvements in surgical performance. While it is important to establish logical validity, a more objective method for establishing validity is required for educational research. A second type of validity is content validity, which is particularly relevant for educational research. An instrument has content validity if it samples the content of a course equally throughout the duration of the learning experience. For example, if a technical skills course covered a range of suturing techniques (surface, at depth, laparoscopic), then all of these types of suturing should be evaluated using the measurement tool. This type of validity is, however, often qualitative in nature and must be accompanied by additional evaluations of validity. Criterion validity refers to the extent to which an instrument’s measures correlate with a gold standard. The concept of criterion validity can be subdivided into two categories: concurrent validity and predictive validity. Concurrent validity refers to the situation in which the gold standard and the newly established instrument are administered simultaneously. Currently, new computer-based measures of technical performance are being compared to more established instruments such as the OSATS. Predictive validity refers to the extent to which the measures generated from an instrument can predict future performance; an example would be the validation of an instrument that could be used to screen applicants to a surgical program and successfully predict success in surgical training. The mainstay of predictive validity
A. Dubrowski et al.
measures are correlational studies, and often multiple predictors are used in a regression equation to ascertain the relative contribution of the different predictors of the dependent variable. The final type of validity is construct validity. This concept addresses the degree to which an instrument measures the psychological trait it purports to measure. This may be fairly straight forward or relatively complex. For example, we might discover that a test we thought was measuring basic understanding of a scientific concept was really measuring the more rudimentary ability of reading comprehension. In that circumstance, we would say the test lacked construct validity because it was really not measuring what we thought it was measuring. Another way of looking at construct validity is predicated on the assumption that skill in a domain increases with experience. Therefore, if we find that a test of surgical skills systematically increases with the level of a trainee, we make the inference that it is construct valid. The comparisons across practice time or across groups could be made using a t-test or an analysis of variance (if the data have been shown to be normally distributed; otherwise an analogous nonparametric test could be used). Correlations can also be used in determining construct validity, particularly when examining relationships between constructs.
9.4.4 Reliability Reliability refers to the precision of a measurement. For example, a highly reliable test will be one that orders candidates on a particular dimension consistently and repeatedly. It is usually reported as an index, one common index being Cronbach’s alpha. Indices like Cronbach’s alpha employ the general principle that individuals knowledgeable or talented in a particular domain will show evidence of that talent throughout a test and across multiple raters. When we obtain a measurement of a person’s performance from a newly developed instrument, we have an observed score that is comprised of two essential elements: an individual’s real performance and the part of the score that can be attributed to measurement error. This error can be related to the participant, the testing situation, the scoring system, or instrumentation. For example, subject error variance can be influenced
9
Research in Surgical Education: A Primer
by factors such as mood, fatigue, previous practice, familiarity with the test, and motivation. The testing situation can also introduce error by factors such as clarity of the instructions, the quality of the test situation (i.e., is it quiet? is there a class going on in the same room?), or the manner in which the results will be used. The scoring system is only as good as the experts who do the scoring: all experts may not be of the same skill level, and some may be more experienced than others when judging performance. Also, in a computer-based approach to evaluation, the calibration of the equipment can lead to obvious instrumentation error; or, if a checklist or global rating score being used is not sensitive enough to discriminate between skills levels (a validity problem), this will contribute to the error variance. Reliability may be estimated through a variety of methods that fall into two types: single-administration (test retest method, alternative form method) and multiple-administration (split-halves method, internal consistency method) Fig. 9.1. Thus, the score variance is a combination of true score variance and error variance. The coefficient of reliability is the ratio of true score variance/observed score variance. However, since true score variance is never known, it is estimated by subtracting error variance from observed score variance.
109
When establishing validity, interclass correlations are used, an example of which is Pearson r, used when one is correlating two different variables (e.g., comparing a computer-based measure of surgical performance to a global rating score). However, when establishing reliability, an intraclass correlation must be used because the same variable is being correlated. When an instrument is administered twice and the first administration is correlated with the second administration, the intraclass correlation is calculated (typically using analysis of variance to obtain the reliability coefficient). One aspect of reliability addressed in educational research is stability, which is determined by establishing similar scores on separate days. A correlation coefficient is calculated for two tests separated by time. Through analysis of variance, the amount of variance on the 2 days accounted for by the separate days of testing (as well as the error variance) can be determined. Alternatively, the reliability of two raters can be established by correlating the performance of two sets of evaluation of the same performance, i.e., the technical performance of a trainee can be judged by two independent (and blinded) experts. If the instrument being used is reliable, the two judges will have a high correlation between their scores. There is no definitive rule for categorizing a correlation as high or
Low reliability
Test scores
Test scores
High reliability
Re-test scores Fig. 9.1 Reliability may be estimated through a variety of methods that fall into two types: single-administration (test retest method, alternative form method) and multiple-administration
Re-test scores (split-halves method, internal consistency method). This figure illustrates test retest method
110
A. Dubrowski et al. Bimodal
Unimodal
Variable A
−1.0 to −0.7 strong negative association. −0.7 to −0.3 weak negative association. −0.3 to +0.3 little or no association. +0.3 to +0.7 weak positive association. +0.7 to +1.0 strong positive association.
Variable A
low; however, the following scheme can be used as a guideline:
Variable B
9.5 Acquisition and Analysis of Data (Experimental Research) 9.5.1 Data Collection Educational research should be seen as an endeavor as important as any other type of research carried out in a Faculty of Medicine, and advocates for educational research should lobby for the necessary laboratory space and resources that are allocated to all medical scientists. It is important to create a controlled and calm data collection environment in order to avoid external events that might influence the participants’ performance (unless, of course, that is the variable of interest). For stability of the data collection session, it is critical that the trainee’s or expert’s performance be evaluated in a quiet room and not just in a corner of a large classroom while other activities are being carried out. This ensures that participants are not distracted by extraneous conversation or events in the environment. It is also critical for the ethics process that the confidentiality of the performance and participation of all participants (the trainee in particular) be protected; this is not possible if the performance takes place in a makeshift environment. Also, for consistency throughout the collection of a project, it is important that video cameras and the like be left in the same position for all participants so that there is consistency in the video tapes that experts will review later for the expert-based evaluations (i.e., check lists, global ratings, and final product analyses). It is also beneficial to have a single research assistant collecting all data, and if there is more than one group (i.e., a control group and an educational intervention group), participants should be allocated in either random or alternating fashion to ensure that any environmental changes that take place over the course of the testing will influence both groups equally. Once the data collection is completed, the process of analysis should begin. The first step in this
Variable B
Fig. 9.2 This figure illustrates two different sets of data. In both cases, similar correlation and slopes will describe the relationship between variables A and B. Careful inspection of the plots reveals, however, that there is a bimodal distribution of data points in the left panel. When data are reanalyzed separately for each of the two groupings, the correlations may be nonsignificant
process should involve producing a scatterplot of the data for each participant for each level of independent variable manipulated. A quick look at the plot will provide insight into whether there are any outliers present and will also provide insight into the impact of error variance on the variance accounted for by the educational manipulations. It is critical to be familiar with and know one’s data prior to running a series of statistics. If the data patterns do not make sense, it is important to double check that there have been no errors in the data management process Fig. 9.2.
9.5.2 Tests of Normality Prior to doing any inferential statistics, the normality of the data set should be evaluated (tests of skewness and kurtosis). Skewness refers to the direction of the hump of a curve: if the hump is shifted to the left, the skewness is positive, and if the hump is to the right, the skewness is negative. Kurtosis refers to the shape of the curve and describes how peaked or flat the curve is in comparison to a normally distributed curve. The researcher’s choice of statistical test will vary, depending on the characteristic of the data set. If he uses a parametric test, it is assumed that the data set he has sampled from is normally distributed. There are no assumptions of normality for nonparametric tests. However, it is preferable to use parametric tests as they have more statistical power, thereby increasing the chance of rejecting a false null hypothesis. Many data analysis methods (t-test, ANOVA, regression) depend on the assumption that data were sampled from a normal Gaussian distribution. The best way to
9
Research in Surgical Education: A Primer
evaluate how far data are from Gaussian is to look at a graph and see if the distribution deviates grossly from a bell-shaped normal distribution. There are statistical tests that can be used to test for normality, but these tests do come with problems. For example, small samples almost always pass a normality test, which has little power to tell whether or not a small sample of data comes from a Gaussian distribution. With large samples, minor deviations from normality may be flagged as statistically significant, even though small deviations from a normal distribution would not affect the results of a t-test or ANOVA (both of which are rather robust to minor deviations from normality). The decision to use parametric or nonparametric tests should usually be made on the basis of an entire series of analyses. It is rarely appropriate to make the decision based on a normality test of one data set. It is usually a mistake to test every data set for normality and use the result to decide between parametric and nonparametric statistical tests. But normality tests can help the researcher understand the data, especially when similar results occur in many experiments.
9.5.3 Three Normality Tests Most statistics packages contain tests of normality. For example, a commonly used test is the Kolmogorov– Smirnov test, which compares the cumulative distribution of the data with the expected cumulative normal distribution and bases its P value on the largest discrepancy. Other available tests are the Shapiro–Wilk normality test and the D’Agostino–Pearson omnibus test. All three procedures test the same null hypothesis – that the data are sampled from a normal distribution. The P value answers the question “If the null hypothesis were true, what is the chance of randomly sampling data that deviate as much (or more) from Gaussian as the data we actually collected?” The three tests differ in how they quantify the deviation of the actual distribution from a normal distribution.
9.5.4 Categories of Statistical Techniques There are two main categories of statistical techniques, those used to test relationships between variables within a single group of participants (e.g., correlation
111
or regression), and those used to evaluate differences between or among groups of participants (e.g., t-test or ANOVA). The purpose of correlation is to determine the relationship between two or more variables. The correlation coefficient is the quantitative value of this relationship. This coefficient can range from 0.0 to 1.0 and can be either positive or negative, with 1.0 being the perfect correlation. When there is one criterion (or dependent) variable and one predictor (or independent) variable, a Pearson product moment coefficient is calculated. However, when there is one criterion and two or more predictor variables, then multiple regression is used. While we often use a simple correlation, in reality, the prediction of educational success typically involves multiple variables. There are various methods for introducing the variables into a multiple regression analysis, including forward selection, backward selection, maximum R-Squared, and stepwise. Each of these methods differs in terms of the order in which the variables are added and how the overlapping variance between variables is treated. In experimental research, the levels of the independent variables are established by the experimenter. For example, an educational researcher might want to evaluate the effects of two types of simulator training on surgical performance. The purpose of the statistical test is to evaluate the null hypothesis (Ho) at a specific level of probability (P < 0.05). That is, do the two levels of treatment differ significantly so that these differences would be attributable to a chance occurrence more than 5 times in 100? The statistical test is always of the null hypothesis. Statistics can only reject or fail to reject the null hypothesis; they cannot accept the research hypothesis, i.e., statistics can only determine if the groups are different, not why they are different. Only appropriate theorizing can do that. One method of making these comparisons when evaluating two treatments is the t-tests. An ANOVA is just an extension of a t-test, which allows for the evaluation of the null hypothesis among two or more groups as long as the groups are levels of the same independent variable. For example, the effects of three types of simulation training can be compared on surgical performance. For both t-tests and ANOVA, the comparisons can be either across groups or within a group. An example of a within group or repeated measures comparison would be multiple samples as a function of time. For a t-test, this could be a pretest/posttest comparison within the same group; or for an ANOVA, there could be multiple samples as a function of practice time. Or, the comparisons
112
A. Dubrowski et al.
Time taken to complete task
a
effect. If the ANOVA is nonsignificant, one cannot go further with the posthoc analysis.
b
9.5.5 Nonparametric Analyses Pre
c
Post
Pre
Group 1
Pre
Post
Group 2
Group 3
Pre
Post
If the data are not normally distributed, an analogous series of nonparametric tests, equivalent to their parametric partners, can be used to establish correlation relationships and comparisons between groups.
Post
Group 1
Group 2
Group 3
Fig. 9.3 This figure illustrates hypothetical results of an experiment in which three groups of trainees participated in three different educational interventions. (a) Shows a main effect for test, where the analyses suggest that, overall, the trainees performed better (in less time) after training. (b) Shows a main effect for group, meaning that at least one of the three groups performed differently than did the other two. Posthoc analyses revealed that group 1 performed in a shorter time than did groups 2 and 3. (c) Shows an interaction between group and test. Specifically, posthoc analyses demonstrate that group 1 was the only one that benefited from the intervention; groups 2 and 3 did not. Therefore, interpretation of the main effects, in light of the significant interaction, may have led to wrong conclusions
can be across groups, i.e., performance of groups that practice under different conditions can be compared. If there is more than one manipulation of an independent variable, a factorial ANOVA is used – for instance, when three training groups are compared both before and after training. To evaluate this design, a two-factor ANOVA would be used that allows for the testing of the group by test interaction. It would be predicted that, for all groups, there would be no differences between the training groups in the pretest since no intervention would have yet been introduced. However, for the posttest, the effects of the various simulators would be apparent Fig. 9.3. When there are more than two levels to an ANOVA, any statistically significant effects require that a posthoc test be applied to determine the level at which the statistical difference exists. That is, if there is a main effect or interaction involving three or more means, a test needs to be conducted to find which means differ. A wide range of tests can be used to posthoc a significant ANOVA effect, ranging from very conservative to very liberal tests. In the field of medical education, often a more conservative posthoc is used, such as Tukey HSD. Posthoc tests can only be used when the “omnibus,” or overall, ANOVA found a significant
9.5.6 Relationships Between Variables To determine a relationship between two variables, a correlation coefficient is calculated. Nonparametric equivalents to the standard correlation coefficient are Spearman R, Kendall Tau, and coefficient Gamma (see Nonparametric correlations). If the two variables of interest are categorical in nature (e.g., “passed” vs. “failed” by “male” vs. “female”), appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for relationships between multiple cases is available, the Kendall coefficient of concordance, which is often used for expressing interrater agreement among independent judges rating (ranking) the same stimuli.
9.5.7 Differences Between Independent Groups The nonparametric equivalents of an independent t-test are the Wald-Wolfowitz runs test, the Mann-Whitney U test, and the Kolmogorov–Smirnov two-sample test. If there are multiple groups, instead of using an ANOVA, use the nonparametric equivalents to this method, the Kruskal-Wallis analysis of ranks, and the Median test.
9.5.8 Differences Between Dependent Groups Tests analogous to the t-test for dependent samples are the Sign test and Wilcoxon’s matched pairs test. If the variables of interest are dichotomous in nature
9
Research in Surgical Education: A Primer
(i.e., “pass” vs. “no pass”), McNemar’s Chi-square test is appropriate. If there are more than two variables, instead of using a repeated measure ANOVA, use the nonparametric equivalents, the Friedman’s two-way analysis of variance, and Cochran Q test (if the variable was measured in terms of categories, e.g., “passed” vs. “failed”).
9.6 Funding, Dissemination, and Promotion Educational research will be judged, besides other forms of scholarly inquiry. As such, one of the essential components of “good educational research” is the capture of peer review funding. This element, perhaps more than any other, will get a young surgical researcher on track with his or her quest to focus on the science of surgical education. Procuring a grant has many obvious benefits. First and foremost, grant monies bring added value to an institution. Second, and equally important, they pay for the work, obviating the need for education to be a “poor second cousin” relying on discretionary handouts to fund its activities. Third, a well-constructed grant will contain the basics of a sound approach to answering the question at hand. It will place new work to be done in the context of existing knowledge and will define a series of methodological steps necessary to affirm or deny an educational hypothesis. Finally, and most importantly, the majority of grants are part of a peer review process. This element, more than any other, is a testament to the reputation of the research team, their previous track record, the integrity of the research proposal, and the impact of the work. Once the work is performed, it is essential that it get into print. “Getting published” in journals with a credible impact factor is an essential measurable product that will be valued by surgical chairs and promotion committees alike. A frequently asked question concerns the best place to publish work done in surgical education. Generally, there are five categories of journals that will accept articles focusing on surgical education. These include journals that focus on specific issues in medical education (for example, Applied Measurement in Education); journals that focus on general issues in medical education, such as Teaching and Learning in Medicine; journals that include a
113
specific emphasis on issues in surgical education, such as Journal of the American College of Surgeons; journals that are “disease specific,” such as Journal of Surgical Oncology; and finally, journals that focus on broad issues in medical science, such as New England Journal of Medicine. There is no obvious, optimal target for research work in surgical education. From one perspective, the broader the readership and more intense the impact factor, the better. However, there is merit in remaining very focused and becoming a “real expert” in a narrow field. Whatever the choice, the old adage that work not published is not really credible work is a fundamental part of our academic fabric. Allied to the published work, of course, is presentation of novel work at academic societies, an activity which is often tied to or proceeds from academic publication. It has often been said that education is the orphan child of our academic tripart. And to a certain extent this has, in the past, been true. Largely, this attitude is the product of individuals’ choice of surgical education as an academic focus by default, rather than by design. Hence, the “surgical educators” of a generation ago were often individuals whose laboratory work had run into difficulty, who transitioned into education during the final stages of their careers, or who branded themselves as clinical surgeons with an “interest in teaching.” This is contrasted today by a cadre of young surgeons with a manifest commitment to surgical education from the inception of their careers, who seek out specific training in surgical education – often including graduate level education– who have protected time for their research, and who work in an environment with an infrastructure supportive of educational science. In this modern context, the orphan child has been adopted. Another common question is whether work done in education “counts” in promotion and tenure decisions. The answer to this is an unequivocal “yes,” if that work is scholarly. Being a good teacher, holding administrative posts in education, mentoring surgical students, and participating in nonpeer reviewed education conferences will have some, but limited, value in the promotion process. By contrast, efforts in education that include capture of a peer review grant, that conform to basic accepted experimental principles, and that result in communication in the published form will absolutely count for promotion in the overwhelming majority of universities.
114
References 1. Reznick RK, Blackmore DE, Dauphinee WD et al (1997) An OSCE for licensure: the Canadian experience. In: Scherpbier AJJA, van der Vleuten CPM, Rethans JJ, van der Steeg AFW (eds) Advances in medical education. Kluwer, Dordrecht, The Netherlands, pp 458–461 2. Barrows HS, Tamblyn RM (1977) The portable patient problem pack: a problem-based learning unit. J Med Educ 52:1002–1004 3. Friedman CP, France CL, Drossman DD (1991) A randomized comparison of alternative formats for clinical simulations. Med Decis Making 11:265–272 4. Stern Z (2003) Opening Pandora’s box: residents’ work hours. Int J Qual Health Care 15:103–105 5. Carlin AM, Gasevic E, Shepard AD (2007) Effect of the 80-hour work week on resident operative experience in general surgery. Am J Surg 193:326–329; discussion 329–330 6. Leach DC (2004) A model for GME: shifting from process to outcomes. A progress report from the Accreditation Council for Graduate Medical Education. Med Educ 38: 12–14 7. Pickersgill T (2001) The European working time directive for doctors in training. BMJ 323:1266 8. Giacomini MK, Cook DJ (2000) Users’ guides to the medical literature: XXIII. Qualitative research in health care A. Are the results of the study valid? Evidence-Based Medicine Working Group. JAMA 284:357–362 9. Giacomini MK, Cook DJ (2000) Users’ guides to the medical literature: XXIII. Qualitative research in health care B. What are the results and how do they help me care for my patients? Evidence-Based Medicine Working Group. JAMA 284:478–482 10. Thorne S (1997) The art (and science) of critiquing qualitative research. In: Morse JM (ed) Completing a qualitative project: details and dialogue. Sage, Thousand Oaks, CA, pp 117–132 11. Mays N, Pope C (1995) Rigour and qualitative research. BMJ 311:109–112
A. Dubrowski et al. 12. Lingard L (2007) The rhetorical ‘turn’ in medical education: what have we learned and where are we going? Adv Health Sci Educ Theory Pract 12:121–133 13. Mays N, Pope C (2000) Qualitative research in health care. Assessing quality in qualitative research. BMJ 320:50–52 14. Dubrowski A (2005) Performance vs. learning curves: what is motor learning and how is it measured? Surg Endosc 19:1290 15. Brydges R, Kurahashi A, Brummer V et al (2008) Developing criteria for proficiency-based training of surgical technical skills using simulation: changes in performances as a function of training year. J Am Coll Surg 206:205–211 16. Campbell DT, Fiske DW (1959) Convergent and discriminant validation by the multitrait-multimethod matrix. Psychol Bull 56:81–105 17. Campbell DT, Stanley JC (1963) Experimental and quasiexperimental designs for research. Rand McNally, Chicago, IL 18. Cook TD, Campbell DT (1979) Quasi-experimentation: design and analysis issues for field settings. Rand McNally, Chicago, IL 19. Hyde WD (1986) How small groups solve problems. Ind Eng 18:42–49 20. Judd CM, Kenny DA (1981) Estimating the effects of social intervention. Cambridge University Press, Cambridge, MA 21. Madu CN, Kuei C-H, Madu AN (1991) Setting priorities for the IT industry in Taiwan – a Delphi study. Long Range Plann 24:105–118 22. Datta V, Chang A, Mackay S et al (2002) The relationship between motion analysis and surgical technical assessments. Am J Surg 184:70–73 23. Kyle Leming J, Dorman K, Brydges R et al (2007) Tensiometry as a measure of improvement in knot quality in undergraduate medical students. Adv Health Sci Educ Theory Pract 12:331–344 24. Martin JA, Regehr G, Reznick R et al (1997) Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 84:273–278
Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
10
Raj Aggarwal and Lord Ara Darzi
Contents 10.1
Introduction .......................................................... 115
10.2
Surgical Competence, Proficiency and Certification .................................................. 116
10.3
Five Steps from Novice to Expert ....................... 116
10.4
Taxonomy for Surgical Performance ................. 117
10.5
Assessment of Technical Skills in Surgical Disciplines ............................................................. 118
10.6
Dexterity Analysis in Surgery ............................. 118
10.7
Video-Based Assessment in Surgery .................. 119
10.8
Virtual Reality Simulators as Assessment Devices................................................................... 120
10.9
Comparison of Assessment Tools ....................... 122
10.10
Beyond Technical Skill......................................... 122
10.11
A Systems Approach to Surgical Safety............. 122
10.12
The Simulated Operating Theatre ..................... 122
10.13
Curriculum Development.................................... 124
10.14
Innovative Research for Surgical Skills Assessment ............................................................ 125
10.14.1 Eye-Tracking Technologies .................................... 125 10.14.2 Functional Neuro-Imaging Technologies ............... 125 10.15
Conclusions ........................................................... 125
References ........................................................................... 126
R. Aggarwal () Department of Biosurgery & Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St. Mary’s Hospital, Praed Street, London W2 1NY, UK e-mail:
[email protected] Abstract In order to provide a high quality health care service to the public, it is essential to employ proficient practitioners, using tools to the highest of their abilities. Surgery being a craft speciality, the focus is on an assessment of technical skill within the operating theatre. However, objective and reliable methods to measure technical skill within the operating theatre do not exist. Numerous research groups, including our own, have reported the objectivity, validity and reliability of technical skills assessments in surgical disciplines. The development and application of objective and reliable technical skills assessments of surgeons performing innovative procedures within the operating theatre can allow a judgement to be made with regard to the credentialing of surgeons to integrate novel procedures into their clinical practice.
10.1 Introduction Within the past decade, the training of a surgical specialist has become a subject of broad public concern. The almost daily articles in the mass media about doctors failing their patients and the widespread growth of the internet have led to a more educated public regarding their choice of medical specialist. To date, anyone undergoing an operative procedure asks their medical friends, “who is the best?” – a question that is traditionally answered in a subjective manner. Though there have been drives to publish mortality rates of individual doctors, departmental units and hospitals for key procedures, these figures can be misleading if casemix is not taken into account. The desire to strive for surgical excellence requires an appropriate measurement tool before inferences can be made regarding the competence of individuals
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_10, © Springer-Verlag Berlin Heidelberg 2010
115
116
within a profession [1]. Traditional measures of capacity to practise as an independent health care specialist have focused upon the use of written examinations, log book review and interview by senior members of the profession. However, these have been found to suffer from subjectivity, unreliability and bias. Particularly within the surgical specialties, it is an absolute to measure one’s technical skill. With the concomitant development of a competency-based surgical training system whereby it is acquisition of skill rather than time spent training, which leads to progression along the curriculum, it is imperative to define feasible, valid and reliable measures of surgical competence [2]. These can not only lead to the definition of benchmark levels of skill to be achieved, but also ensure the delivery of the curriculum in a standardised manner, making it possible to perform comparisons across trainees, training centres and regions in terms of not only skills accrued, but also costs entailed to achieve this.
10.2 Surgical Competence, Proficiency and Certification In 1999, The Institute of Medicine report, “To Err Is Human”, raised awareness of the significant number of medical errors committed, together with the deficiencies in the evaluation of performance and competence with regard to the medical profession [3]. It was suggested that an infrastructure that can support the assessment of competence needed to be developed. Effective since July 2002, the Accreditation Council for Graduate Medical Education (ACGME) listed six categories of competence, defined as the ACGME Outcomes Project (Table 10.1) [4]. Similarly, the American Board of Medical Specialties (ABMS) developed a set of criteria that defines competence in medicine. The description of general competency involves six components: patient care, medical knowledge, practicebased learning, interpersonal and communication skills, Table 10.1 Maintenance of certification
R. Aggarwal and L. A. Darzi
professionalism and systems-based practice. Based, to some extent, on these criteria, the ABMS and ACGME issued a joint statement in 2001 on surgical competences and, furthermore, the need for maintenance of certification. The need to ensure standardised definition of these terms is paramount and has led to an international consensus conference to establish definitions of the terms to be used when assessing technical skills in July 2001. A great deal of the discussion was based on the definitions of the terms competence, proficiency and expertise, which have been described by Dreyfus and Dreyfus in 1986 [5].
10.3 Five Steps from Novice to Expert It is in general accepted for most teaching to be concerned with bringing an individual up to a level of “competence” in their discipline. However, it is clear that some people go beyond this level to achieve expertise. How is this defined, and more importantly, if at all, can it be taught? Dreyfus and Dreyfus (1986), drawing on their different perspectives as computer scientist and philosopher, developed a five-stage theory of skill acquisition from novice through to expert. This was based upon acquiring skill through instruction and experience, with changes in task perception and mode of decision-making as skills improve. The five stages were defined as novice, advanced beginner, competent, proficient and expert (Table 10.2). During the first stage of skill acquisition, the novice learns to recognise facts and features relevant to the skill and acquires rules for determining actions based upon those facts and features. Relevant elements of the situation are Table 10.2 Five stages of skill acquisition Skill level Components Perspective Novice
Context-free
None
Analytical
Adv beginner
Context-free None and situational
Analytical
Competent
Context-free Chosen and situational
Analytical
Proficient
Context-free Experienced and situational
Analytical
Expert
Context-free Experienced and situational
Intuitive
Evidence of professional standing Evidence of lifelong learning Evidence of cognitive expertise Evidence of practice performance
Decision
10
Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
clearly identifiable without reference to the overall situation, i.e. context-free. For example, the novice laparoscopic camera holder is told to keep the surgeon’s working instrument in the middle of the picture at all times. This rule ignores the context of the operative procedure, and the beginner is not taught that in certain situations, it may be appropriate to violate that rule. The novice camera holder wishes to please the laparoscopic surgeon, and judges his performance by the number of times he is told to change the camera position. However, there is no coherent sense of the overall task. Having acquired a few more rules, performance of the task requires extensive concentration, with inability to talk or listen to advice from the senior surgeon. The rules enable safe acquisition of experience, though they must then be discarded. Performance improves when the surgeon has acquired considerable practical experience in coping with real-life situations. The advanced beginner can begin to recognise meaningful elements when they are present because of a perceived similarity with prior examples. These new elements are referred to as “situational” rather than “context-free”, though they are difficult to define per se. For example, the laparoscopic trainee can be taught to halt bleeding with diathermy to the blood vessel. This depends on the size and location of the vessel, though diathermy may cause more harm than good in some cases. It is, thus, the experiences that are important, rather than the presence or absence of concrete rules. With greater experience, the number of recognisable context-free and situational elements present in real-world circumstances eventually becomes overwhelming. A sense of what is important is missing. It is, thus, necessary to acquire a hierarchical sense of decision-making by choosing a plan to organise the situation, and then by examining specific factors within that plan, be able to attend to them as appropriate. Thus, competence is based upon having a goal in mind and attending to the most important facts to achieve that goal. The importance of certain facts may be dependent on the presence or absence of other facts. The competent surgeon may no longer dissect the colon in a pre-defined manner, but rather move from one side of the organ to the other as appropriate, to ensure safe and steady progress of the operative procedure. In order to perform at the competent level, the surgeon must choose an organising plan which is related to the environmental conditions, i.e. deliberate planning. Up to this level, the learner has made conscious choices of both goals and decisions after reflecting
117
upon various alternatives. With experience comes proficiency, enabling an individual to base future actions on past similar situations, with anticipation of the eventual outcomes. Intuition or know-how is based upon seeing similarities with previous experiences, though the proficient performer still thinks analytically about what to do. For example, a surgeon will notice on a ward round that a post-operative patient looks unwell and queries whether the bowel anastomosis has leaked. With the help of a series of tests together with intuition, or knowhow, the surgeon can decide upon whether to perform a re-operation and repair of the anastomotic leak. Expertise is based upon mature and practiced understanding, enabling the individual to know what to do. The expert does not need to follow rules or deconstruct the situation into individual facts, but instead “sees” the whole picture at first glance. For instance, an expert surgeon will very quickly decide to convert a laparoscopic to an open procedure due to anticipated difficulties with the case. A more junior surgeon will tend to struggle with the laparoscopic approach, with a greater possibility of causing injury, lengthening operative time and, overall, leading to a poorer operative outcome. It may be said that the expert surgeon has a “vision” of what is possible, and perhaps more importantly, what is not possible. Whilst most expert performance is ongoing and non-reflective, there are situations when time permits and outcomes are crucial, during which an expert will deliberate before acting. This may occur in a discussion with other experts, during a novel or unforeseen event, or when the environmental conditions are altered. Overall, there is a progression through these five stages of skills acquisition from the analytical behaviour of the detached subject, consciously decomposing the environment into recognisable elements and following abstract rules, to involved skill behaviour based on an accumulation of concrete experiences and the unconscious recognition of new situations as similar to whole remembered ones.
10.4 Taxonomy for Surgical Performance In order to define and measure the development of surgical expertise, it is necessary to develop a structured framework upon which this can be based. In July 2001,
118
Satava et al. convened an international workshop to enable standardisation of definitions, measurements and criteria with relevance to objective assessment of surgical skill [6]. A hierarchical approach to surgical practice was proposed: Ability: the natural state or condition of being capable, aptitude Skill: a developed proficiency or dexterity in some art, craft or the like Task: a piece of work to be done, a difficult or tedious undertaking Procedure: a series of steps taken to accomplish an end Using a surgical example, psychomotor ability, or aptitude, is defined as one’s natural performance with regard to operating on a two-dimensional screen whilst interacting in a three-dimensional space, i.e. laparoscopic surgery. With training, it is possible to acquire these abilities in order to develop skills such as instrument handling, suturing and knot-tying. A task is considered to be part of a procedure, for example, being able to perform a sutured anastomosis – this is not procedure-specific. Finally, a procedure is operation to be carried out, for example, laparoscopic adrenalectomy.
10.5 Assessment of Technical Skills in Surgical Disciplines In 1991, the Society of Gastrointestinal and Endoscopic Surgeons (SAGES) required surgeons to demonstrate competency before performing a laparoscopic procedure [7]. Competency was based on the number of procedures performed and time taken or on evaluation of the trainee by senior surgeons. These criteria are known to be crude and indirect measures of technical skill, or to suffer from the influence of subjectivity and bias. Professional organisations have recently recognised the need to assess surgical performance objectively. For any method of skill assessment to be used with confidence, it must be feasible, valid and reliable [2]. Feasibility is difficult to define as it is dependent upon the tool to be used, its cost, size, space requirements, transportability, availability, need for maintenance and acceptability to subjects and credentialing committees. Validity is defined as “the property of being true”, correct and conforming with reality’, with reference to
R. Aggarwal and L. A. Darzi
the concept of whether a test measures what it purports to measure. Face validity refers to whether the model resembles the task it is based upon, and content validity considers the extent to which the model measures surgical skill and not simply anatomical knowledge. Construct validity is a test of whether the model can differentiate between different levels of experience. Concurrent validity compares the test to the current “gold standard”, and predictive validity determines whether the test corresponds to actual performance in the operating theatre. Reliability is a measure of the precision of a test and supposes that results for a test repeated on two separate occasions, with no learning between the two tests, will be identical. It is measured as a ratio from 0 to 1.0, a test with reliability of 0–0.5 being of little use, 0.5–0.8 being moderately reliable, and over 0.8 being the most useful. This is known as the test–retest reliability, though the term inter-rater reliability is also important. This is a measure of the extent of agreement between two or more observers when rating the performance of an individual, for example, during video observation of a surgical procedure. Current measures for objective assessment of technical skill consist of dexterity analysis and video-based assessment. These measures can also aid structured progression during training, together with identification of trainees who require remedial action.
10.6 Dexterity Analysis in Surgery Laparoscopic surgery lends itself particularly well to motion analysis, as hand movements are confined to the limited movements of the instruments [8]. Smith et al. connected laparoscopic forceps to sensors to map their position in space, and relayed movements of the instruments to a personal computer. This enabled calculation of the instrument’s total path length, which was compared to the minimum path length required to complete the task. The Imperial College Surgical Assessment Device (ICSAD) has sensors placed on the back of a surgeon’s hands (Fig. 10.1). A commercially available device (Isotrack IITM; Polhemus, VT) emits electromagnetic waves to track the position of the sensors in x, y and z axes 20 times per second. This device is able to run from a standard laptop computer and data are analysed in terms of time taken, distance travelled and total number
10 Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
119
Fig. 10.1 The imperial college surgical assessment device (ICSAD)
of movements for each hand. Previous studies have confirmed the construct validity of the ICSAD as a surgical assessment device for open and laparoscopic procedures, both for simple tasks and real procedures such as a laparoscopic cholecystectomy [9]. Experienced laparoscopic surgeons made significantly fewer movements than occasional laparoscopists, who in turn were better than novices in the field. The ICSAD device has also been shown objectively to assess the acquisition of psychomotor skill of trainees attending laparoscopic training courses. The Advanced Dundee Endoscopic Psychomotor Tester (ADEPT) is another computer-controlled device, consisting of a static dome enclosing a defined workspace, with two standard laparoscopic graspers mounted on a gimble mechanism [10]. Within the dome is a target plate containing innate tasks, overlaid by a springmounted perspex sheet with apertures of varying shapes and sizes. A standard laparoscope relays the image to a video monitor. Each task involves manipulation of the top plate with one instrument enabling the other instrument to negotiate the task on the back plate through the access hole. The system registers time taken, successful task completion, angular path length and instrument error score (a measure of instrument contact with the sides of the front plate holes). Experienced surgeons exhibit significantly lower instrument error rates than trainees on the ADEPT system. Comparison of performance on ADEPT also correlated well with a blinded assessment of clinical competence, a measure of concurrent validity. Test– retest reliability of the system produced positive correlations for all variables when performance of two consecutive test sessions was compared.
These three methods of assessing dexterity enable objective assessment of surgical technical skill, but only the ICSAD device can be used to assess real operations. However, in this case it is important to know whether the movements made are purposeful. For example, the common bile duct may be injured during a laparoscopic cholecystectomy, and dexterity analysis alone cannot record this potentially disastrous error. To confirm surgical proficiency, it is necessary to analyse the context in which these movements are made.
10.7 Video-Based Assessment in Surgery During the introduction of laparoscopic cholecystectomy, SAGES and European Association for Endoscopic Surgery (EAES) advocated proctoring of beginners by senior surgeons before awarding privileges in laparoscopic surgery. A single assessment is open to subjectivity and bias, although additional criteria can improve reliability and validity. An example of this is the Objective Structured Clinical Examination (OSCE), a method of assessing the clinical skills of history taking, physical examination and patient–doctor communication. Martin et al. developed a similar approach to the assessment of operative skill, the objective structured assessment of technical skill (OSATS) [11] (Table 10.3). This involves six tasks on a bench format, with direct observation and assessment on a task-specific checklist, a seven-item global rating score and a pass/fail judgement. Twenty surgeons in training of varying experience performed
120
R. Aggarwal and L. A. Darzi
Table 10.3 The objective structured assessment of technical skills (OSATS) 1 2 3
4
5
Respect for tissue
Frequently used unnecessary force on tissue or caused damage by inappropriate use of instruments
Careful handling of tissue, but occasionally caused inadvertent damage
Consistently handled tissues appropriately with minimal damage
Time & motion
Many unnecessary moves
Efficient time/motion but some unnecessary moves
Economy of movement and maximum efficiency
Instrument handling
Repeatedly makes tentative or awkward moves with instruments
Competent use of instruments although occasionally appeared stiff or awkward
Fluid moves with instruments and no awkwardness
Knowledge of instruments
Frequently asked for the wrong instrument or used an inappropriate instrument
Knew the names of most instruments and used appropriate instrument or the task
Obviously familiar with the instruments required and their names
Use of assistants
Consistently placed assistants poorly or failed to use assistants
Good use of assistants most of the time
Strategically used assistant to the best advantage at all times
Flow of operation & forward planning
Frequently stopped operating or needed to discuss next move
Demonstrated ability for forward planning with steady progression of operative procedure
Obviously planned course of operation with effortless flow from one move to the next
Knowledge of specific procedure
Deficient knowledge. Needed specific instruction at most operative steps
Knew all important aspects of the operation
Demonstrated familiarity with all aspects of the operation
equivalent open surgical tasks on the bench format and on live anaesthetised animals. There was excellent correlation between assessment on the bench and live models, although test–retest and inter-rater reliabilities were higher for global scores, making them a more reliable and valid measurement tool. However, a global rating scale is generic and may ignore important steps of a particular operation. Eubanks et al. developed a procedure-specific scale for laparoscopic cholecystectomy with scores weighted for completion of tasks and occurrence of errors [12]. For example, liver injury with bleeding scored 5, whereas common bile duct injury scored 100. Three observers rated 30 laparoscopic cholecystectomies performed by trainees and consultant surgeons. Correlation between observers for final scores was good, although correlation between final score and years of experience was only moderate. A similar approach identified errors made by eight surgical registrars undertaking a total of 20 laparoscopic cholecystectomies. The procedure was broken down into ten steps such as “dissect and expose cystic structures” and “detach gallbladder from liver bed”. Errors were scored in two categories: inter-step (procedural) errors involved omission or rearrangement of correctly
undertaken steps, and intra-step (execution) errors involved failure to execute an individual step correctly. There was a total of 189 separate errors, of which 73 (38.6%) were inter-step and 116 (61.4%) intra-step. However, only 9% of the inter-step errors required corrective action, compared with 28% of intra-step errors. All of the above rating scales are complex and time consuming; for example, the assessment of 20 surgical trainers on the OSATS required 48 examiners for 3 hours each. Furthermore, the scales are open to human error and not entirely without subjectivity. To achieve instant objective feedback of a surgeon’s technical skills, virtual reality simulation may be more useful.
10.8 Virtual Reality Simulators as Assessment Devices The term virtual reality refers to “a computer-generated representation of an environment that allows sensory interaction, thereby giving the impression of actually being present”.
10 Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
121
Fig. 10.2 The minimally invasive surgical trainer – virtual reality (MIST-VR)
The MIST-VR laparoscopic simulator (Mentice, Gothenburg, Sweden) comprises two standard laparoscopic instruments held together on a frame with position-sensing gimbals (Fig. 10.2). These are linked to a Pentium personal computer (Intel, Santa Clara, CA) and movements of the instruments are relayed in real time to a computer monitor. Targets appear randomly on the screen and are “grasped” or “manipulated”, with performance measured by time, error rate and economy of movement for each hand. The LapSim (Surgical Science, Gothenburg, Sweden) laparoscopic trainer has tasks that are more realistic than those of the MIST-VR, involving structures that are deformable and may bleed. The Xitact LS500 (Xitact, Morges, Switzerland) laparoscopy simulator comprises tasks such as dissection, clip application and tissue separation, the integration of which can produce a procedural trainer. It differs from the MIST-VR and LapSim in that it incorporates a physical object, the “virtual abdomen”, with force feedback. Other newer simulators include the ProMIS Surgical Simulator (Haptica, Dublin, Ireland) and LapMentorTM (Simbionix, Cleveland, OH). The MIST-VR simulator has tasks that are abstract in nature, enabling the acquisition of psychomotor skill rather than cognitive knowledge. This enables the simulator to be used in a multi-disciplinary manner to teach the basic skills required for all forms of minimally invasive surgery. However, newer simulators
have augmented their basic skills programmes to incorporate parts of real procedures, allowing trainees to learn techniques they would use in the operating theatre. For example, the LapSim has a module for dissection of Calot’s triangle, and the most recently launched LapMentor simulator enables the trainee to perform a complete laparoscopic cholecystectomy with the benefit of force feedback. Although the task-based simulators are more advanced in terms of software, they are bulkier and more expensive. Using these simulators, trainees can practise standardised laparoscopic tasks repeatedly, with instant objective feedback of performance. The simulators are portable, use standard computer equipment and are available commercially. With graded exercises at different skill levels, they can be used as the basis for a structured training programme. The feedback obtained also enables comparisons to be made between training sessions and trainees. Studies to confirm the role of virtual reality simulators as assessment devices have concentrated on the demonstration of construct validity, with experienced surgeons completing the tasks on the MIST-VR significantly faster, with lower error rates and greater economy of movement scores. A direct comparison of performance is possible as all surgeons complete exactly the same task, without the effects of patient variability or disease severity. The tasks can be carried out at any time and further processing is not required
122
to produce a test score. This can lead to the development of criterion scores that have to be achieved before operating on real patients. At the American College of Surgeons’ meeting in 2001, Gallagher et al. described the performance of 210 experienced laparoscopic surgeons on two trials of the MIST-VR [13]. The aim was to benchmark performance of these surgeons to confirm future use as an assessment tool. The results revealed marked variability in the scores obtained, together with a significant learning effect between trials. To use such data for highstakes assessments, perhaps a pool of expert scores from all centres currently using virtual reality simulation might lead to the development of an international benchmark for trainee surgeons. Furthermore, as some trainees take longer to achieve pre-defined levels of proficiency than others, this may enable particularly gifted trainees to be fast-tracked into an advanced laparoscopic programme and the true development of a competency, rather than a time-based curriculum.
R. Aggarwal and L. A. Darzi
teamwork, communication, judgement, and leadership underpinning the development of surgical competence [14]. High reliability organisations such as aviation, the military and nuclear industries have noted the importance of a wide variety of factors in the development of a favourable outcome. These include ergonomic factors, such as the quality of interface design, team coordination and leadership, organisational culture, and quality of decision making. In a surgical context, the application of a systems approach can lead to the identification of possible sources of error, which are not immediately apparent. These may include the use of inappropriately designed instruments, an untrained team member, repeated interruptions by ward staff or a tired surgeon. The development of a human factors approach has led to safer performance in industry, and it is important to address these issues in the operating theatre.
10.11 A Systems Approach to Surgical Safety 10.9 Comparison of Assessment Tools Currently there is no consensus regarding the optimal assessment tool for laparoscopic procedures, and perhaps video-based and dexterity systems should be used in conjunction. The authors’ department has recently developed new software to enable the ICSAD trace to be viewed together with a video of the procedure, leading to a dexterity-based video analysis system. This still requires an investment of time to assess the procedure on a rating scale, but it may be possible to identify areas of poor dexterity and to concentrate on videobased assessment of these areas alone.
The systems approach in understanding the surgical process and outcomes has important implications for error reduction. This approach accepts that humans are fallible and errors are to be expected, even in the best organisations. Counter measures are based upon the building of defences to trap errors, and mitigation of their effects should one occur. This consists of altering the attitudes between different individuals and modifying the behavioural norms that have been established in these work settings. An example of this is the specification of theatre lists for training junior surgeons, ensuring that fewer cases are booked, thereby reducing the pressure on both the surgeon and the rest of the team to complete all procedures in the allocated time.
10.10 Beyond Technical Skill Traditionally, measures of performance in the operating theatre have concentrated on assessing the skill of the surgeon alone, and more specifically, technical proficiency. This has been done by assessment of time taken to complete the operation, and more recently, with the use of rating scales and motion analysis systems. However, technical ability is only one of the skills required to perform a successful operation, with
10.12 The Simulated Operating Theatre At our centre, we have developed a simulated operating theatre to pilot comprehensive training and assessment for surgical specialists (Fig. 10.3). This consists of a replicated operating theatre environment and an adjacent control room, separated by one-way viewing glass. In
10 Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
Fig. 10.3 The simulated operating theatre
the operating theatre is a standard operating table, diathermy and suction machines, trolleys containing suture equipment and surgical instruments and operating room lights. A moderate fidelity anaesthetic simulator (SimMan, Laerdl) consists of a mannequin which lies on the operating table and is controlled by a desktop computer in the control room. This enables the creation of a number of scenarios such as laryngospasm, hypoxia, and cardiac arrhythmias. A further trolley is available, containing standard anaesthetic equipment, tubes and drugs. The complete surgical team is present, consisting of an anaesthetist, anaesthetic nurse, primary surgeon, surgeon’s assistant, scrub nurse and circulating nurse. Interactions between these individuals are recorded using four ceiling mounted cameras and unobtrusively placed microphones. The multiple streams of audio and video data, together with the trace on the anaesthetic monitor, are fed into a clinical data recording (CDR) device. This enables those present in the control room to view the data in real time and for recordings to be made for debriefing sessions. In 1975, Spencer remarked that a skilfully performed operation is 75% decision making and only 25% dexterity. Decision-making and other non-technical skills are not formally taught in the surgical curriculum, but are acquired over time. In an analogous manner, it should be possible to use the simulated operating theatre environment to train and assess performance of surgeons at skills such as team interaction and communication. This situation will also allow surgeons to benefit from feedback, by understanding the nature and effect of their mistakes, and learn from them. In a preliminary study, 25 surgeons of varying grades completed part of a standard varicose vein
123
operation on a synthetic model (Limbs & Things, Bristol), which was placed over the right groin of the anaesthetic simulator [15]. The complete surgical team was present, the mannequin draped as for a real procedure, and standard surgical ins truments available to the operating surgeon. Video-based, blinded assessment of technical skills discriminated between surgeons according to experience, though their team skills measured by two expert observers on a global rating scale failed to show any similar differences. Many subjects did not achieve competency levels for pre-procedure preparation (90%), vigilance (56%), team interaction (27%) and communication (24%). Furthermore, only two trainees positioned the patient pre-operatively, and none waited for a swab/instrument check prior to closure. Feedback responses from the participants were good, with 90% of them agreeing that the simulation was a realistic representation of an operating theatre, and 88% advocating this as a good environment for training in team skills. The greatest benefit of simulator training in aviation and anaesthetics has been for training during crisis scenarios. The varicose vein procedure was subsequently modified to include a bleeding scenario – a 5 mm incision was made in the “femoral vein” of the model and connected to a tube which was, in turn, connected to a drip bag containing simulated blood. This was controlled with a three-way tap. A further group of 10 junior and 10 senior surgical trainees were recruited to the study [16]. The simulation was run as before except that at a standardised point, the tap was opened. The trainee’s technical ability to control the bleeding together with their team skills were assessed in a blinded manner by three surgeons and one human factors expert. Once again, seniors scored higher than juniors for technical skills, though there were no differences in human factors skills such as time taken to inform the team of the crisis. A majority of the participants found the model, simulated operating theatre and bleeding scenario to be realistic, with over 80% of them considering the crisis to be suitable for assessment and training of both technical and team skills. These studies have demonstrated the face validity of a novel crisis simulation in the simulated operating theatre, and describe how they can be used to assess the technical and non-technical performance of surgical trainees. Recent work has also introduced the notion of a ceiling effect in technical skills performance, with there
124
being little difference between the performance of senior trainees and consultants on bench top models. This may be due to the limited sensitivity of the tools used to assess technical skill, to the fact that most senior trainees have acquired the necessary technical skills required to operate competently and that progression to expert performance is then dependant upon non-technical skills such as decision making, knowledge and judgement.
10.13 Curriculum Development The aim of a surgical residency programme is to produce competent professionals, displaying the cognitive, technical and personal skills required to meet the needs of society. However, many surgeons are concerned that this will not be possible with the limitations placed upon work hours and the potential reduction in training opportunities by up to 50%. Furthermore, surgeons are also faced with increased public and political pressures to achieve defined levels of competence prior to independent practice. Solutions to the work hours and competency issues require the formalisation of training programmes, whereby training occurs within a pre-defined curriculum to teach the skills required. Assessment of skill is performed at regular intervals using reliable and valid measurement tools. Successful completion of the assessment enables the trainee to progress onto the next stage of the programme, though failure will necessitate repetition of the completed block of training. This describes the development of a competency-based curriculum, enabling surgeons to acquire skill in a logical, stepwise approach. Training at each step of the curriculum must begin in the skills laboratory, utilizing tools such as synthetic models, animal tissue and virtual reality simulation. Sessions should be geared to the level of the trainees, with adequate faculty available as trainers and mentors. Occurrence of these sessions should not be related solely to the schedules of senior surgeons, but must be aligned with educational theories of learning. Furthermore, these sessions must be an integral and compulsory part of the trainees’ timetable. Junior trainees should probably attend once a week, though more senior trainees may require fewer, more focused sessions. This should be coupled with adequate training time in the operating theatre preferably with the same
R. Aggarwal and L. A. Darzi
mentor as in the skills laboratory. Division of operating schedules primarily into service or training purposes can achieve a balance between the needs of the surgeon and the health authority. Built into the development of a staged curriculum is the objective assessment of competence, and present methods have concentrated purely upon technical skill. Video-based and dexterity measurement systems have been extensively validated in the surgical literature, although none are currently used as a routine assessment of surgical trainees. This is primarily because of time and cost limitations of the assessment tool, though virtual reality simulation has been shown to be a promising tool for instant, objective assessment of laparoscopic performance. This simulation has further benefits as assessment occurs on standardised models in a quiet and non-pressured environment. The development of a simulated operating theatre environment, analogous to those used by the military for training in conflict situations, can for the first time assess and train surgeons in a realistic environment. The presence of the entire surgical team can enable assessment of interpersonal skill, and the introduction of crisis scenarios leads to an evaluation of the surgeon’s knowledge and judgement. This should be integrated into the end of each training period, culminating in a crisis assessment to confirm the development of competence prior to progressing onto the next stage of the programme. This stepwise, competence-based curriculum can lead to the development of technically skilled surgeons, and it acknowledges the importance of life long learning. It refocuses the emphasis on what is learned as opposed to how many hours are spent in the hospital environment. It can also provide trainees with the choice of leaving the programme at an earlier stage in order to concentrate upon a generalist surgical practice, and also provide a purely service-orientated commitment to the health service. Surgeons are also able to take a career break, and on re-entry, must achieve competence prior to progressing further along their training pathway. The organisation of these surgical training programmes should occur as part of a national project, with agreed definitions of training schedules and competency assessments. It is imperative that the programmes are evidence-based and are regularly audited to ensure maintenance of standards of excellence. This will make the programmes accountable and strengthens the case for health care providers to resource them.
10 Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
10.14 Innovative Research for Surgical Skills Assessment In order to get beyond measurement of technical skills of the surgeon, described here are two promising technologies for the elucidation of higher level processes of surgical performance.
10.14.1 Eye-Tracking Technologies To get closer to the thought processes of the surgeon, eye tracking technologies have also been exploited in order to measure surgical skill [17]. Initial studies used remote eye-trackers of radiologists reading a CT scan of the thorax – the differences in scan patterns between junior and senior radiologists highlighted the possible application of such a tool during minimally invasive procedures. A comparison of tool and eye-tracking during a laparoscopic task reveals the need for inexperienced subjects to fixate onto the tip of the laparoscopic instrument, whereas experienced subjects fixated upon the target. The inference is that experienced subjects possess the hand-eye coordination to know the position of their tool without having to fixate their eyes on it, and can, thus, perform the task in a more efficient manner. A pilot study of eye-tracking during real laparoscopic cases provides further insight into surgical performance. During a laparoscopic cholecystectomy, fixation points were most stable for the principal surgical steps of dissection of Calot’s triangle and clip/cut of cystic structures. A signature pattern is produced which can be analysed and used to serve both as a teaching and assessment tool for all surgeons. The intention is to develop signature patterns during surgical manoeuvres which can enable the diagnosis of whether an individual is competent to perform a particular task or procedure.
10.14.2 Functional Neuro-Imaging Technologies Though eye and motion tracking technologies are promising tools for surgical skills assessment, it is believed that further information may be gleaned from within the processing capabilities of the surgeon – within their brains.
125
Functional neuro-imaging techniques have recently been used to enhance the understanding of motor skills acquisition for tasks such as piano playing, apple-coring and non-surgical knot tying. Studies have demonstrated that the areas of brain that are activated when performing a motor task vary according to the level of expertise. Furthermore, it appears that brain activation is dynamic and varies according to the phase of motor skills acquisition, the so-called “neuroplasticity”. For example, several studies have demonstrated that the prefrontal cortex plays a crucial role in early learning, since activation in this region wanes as performance becomes more automated. This has prompted investment in a NIRS (or near infra-red spectroscopy) functional neuro-imaging machine for the purposes of surgical skills training and assessment [18]. The system works by shining near infra-red light onto the scalp; after being absorbed and scattered by brain tissues inside the head, a signal is recorded based on a set of different scalp sites by using lowlight sensitive photo-detectors. The differential absorption spectra of oxygenated and de-oxygenated haemoglobin provide an indirect measure of cortical perfusion and can be related in real time to a surgical task. Preliminary studies to date have reproduced the results of piano players whereby experienced subjects exhibited decreased frontal cortical activation when compared to novices.
10.15 Conclusions The current climate of medical education and health care delivery is being transformed at a rapid rate. Gone are the days of residency programmes focused on unlimited hours, patient service and learning through osmosis. Dissatisfaction and drop-out rates of residents are also on the increase, together with profound changes in the demographics of student admissions to medical schools. There is a shift in the philosophy of medical education not only to incorporate lifestyles, but to deliver a competency-based system of achievement and progression. In August 2005, the Foundation Programme was introduced in the UK for all newly qualified doctors. The aim was to develop specific and focused learning objectives, with built-in demonstration of clinical competence before progression onto specialist or general practice training. Each trainee completes a number of assessments,
126
leading to the development of a portfolio of clinical performance [19]. But what then of the implications to workforce needs and planning for the health service? Current trends in general surgery workforce data support a future deficit of surgeons that can be explained by an ageing population with increased surgical needs, a growing number of outpatient surgical procedures and increasing sub-specialisation within the field of general surgery. Other interventional specialties share similar experiences. A competency-based training curriculum has the potential to cause significant disruption to managing workforce aspects of health care delivery. Workforce numbers depend on credentialing for key procedures, and a failure to achieve milestones leads to repetition of the training period. A shortfall of clinician(s) may ensue, and strategies to manage this potential deficit have not been identified. The medical profession is at a crossroad of how to apply competency-based training and subsequent practice. Within the airline industry, regular testing of individuals is carried out to ensure competence. Failure to achieve performance goals results in a period of retraining and subsequent revalidation of skills. This concept does not exist within health care; in the UK, individuals who are found to under-perform are suspended from practice and investigated by the General Medical Council (GMC). This process involves a 2-phased approach of data collection from the clinician’s workplace through audit of outcomes and interview of colleagues, followed by a 1-day structured evaluation of performance. Failure leads to termination of one’s registration with the GMC. It is not inconceivable that individuals could be retrained, and the analogy of a “boot camp” to redress the problem is an attractive idea. Though it is important to consider how to manage this shift in medical practice to deliver competent doctors, the capture and delivery of financial resources are integral concern to ensure the sustainability of competency-based training and practice. Training budgets within the UK have been reduced, and service delivery has taken centre stage. It is necessary to recognise the arena with regard to accountability, i.e. investment into a competency-based programme may lead to improvements in patient safety through maintenance and delivery of the highest standards of care. Doctors failing to meet these standards will no longer continue to practise regardless, perhaps with a reduction in morbidity and mortality and concomitant financial gain.
R. Aggarwal and L. A. Darzi
Nursing, public health and family medicine have sought to define competency-based models and their application. A German study found that implementation of a competency-based graduate training programme within a neurology department during a 1-year pilot resulted in motivation of learners and training as long as adequate resources were provided, together with a training system for the trainers. Though funding is important, it is the last point that is crucial to delivery, i.e. training schemes, workshops or courses to “train the trainers”. Not only does this lead to competency-based delivery of residency programmes, but it also fosters the development of incentives for lifelong learning and career growth. It is necessary for policy makers, medical educators, clinicians and credentialing bodies to develop a consensus on credentialing for the future of the medical profession. It must also be recognised that there is a knowledge gap, and research into tools that not only can objectively and accurately define performance but perhaps predict future performance, and must continue to be funded. The resources required are substantial both financially and in human terms, but now is the time to seize the opportunity to ensure that a competency-based delivery of health care will become both an achievable and successful endeavour [8].
References 1. Moorthy K, Munz Y, Sarker SK et al (2003) Objective assessment of technical skills in surgery. BMJ 327: 1032–1037 2. Gallagher AG, Ritter EM, Satava RM (2003) Fundamental principles of validation, and reliability: rigorous science for the assessment of surgical education and training. Surg Endosc 17:1525–1529 3. Kohn LT, Corrigan JM, Donaldson MS (2000) To err is human: building a safer health system. National Academy Press, Washington 4. Pellegrini CA (2002) Invited commentary: the ACGME “Outcomes Project”. American Council for Graduate Medical Education. Surgery 131:214–215 5. Dreyfus HL, Dreyfus SE (1986) Mind over machine. Free, New York 6. Satava RM, Cuschieri A, Hamdorf J (2003) Metrics for objective assessment. Surg Endosc 17:220–226 7. Society of American Gastrointestinal Surgeons (SAGES) (1991) Granting of privileges for laparoscopic general surgery. Am J Surg 161:324–325 8. Aggarwal R, Hance J, Darzi A (2004) Surgical education and training in the new millennium. Surg Endosc 18:1409–1410
10
Measurement of Surgical Performance for Delivery of a Competency-Based Training Curriculum
9. Aggarwal R, Grantcharov T, Moorthy K et al (2007) An evaluation of the feasibility, validity, and reliability of laparoscopic skills assessment in the operating room. Ann Surg 245:992–999 10. Francis NK, Hanna GB, Cuschieri A (2002) The performance of master surgeons on the advanced Dundee endoscopic psychomotor tester: contrast validity study. Arch Surg 137:841–844 11. Martin JA, Regehr G, Reznick R et al (1997) Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 84:273–278 12. Eubanks TR, Clements RH, Pohl D et al (1999) An objective scoring system for laparoscopic cholecystectomy. J Am Coll Surg 189:566–574 13. Gallagher AG, Smith CD, Bowers SP et al (2003) Psychomotor skills assessment in practicing surgeons experienced in performing advanced laparoscopic procedures. J Am Coll Surg 197:479–488 14. Yule S, Flin R, Paterson-Brown S et al (2006) Non-technical skills for surgeons in the operating room: a review of the literature. Surgery 139:140–149
127
15. Moorthy K, Munz Y, Adams S et al (2005) A human factors analysis of technical and team skills among surgical trainees during procedural simulations in a simulated operating theatre. Ann Surg 242:631–639 16. Moorthy K, Munz Y, Forrest D et al (2006) Surgical crisis management skills training and assessment: a simulation[corrected]based approach to enhancing operating room performance. Ann Surg 244:139–147 17. Dempere-Marco L, Hu X-P, Yang G-Z (2003) Visual search in chest radiology: definition of reference anatomy for analysing visual search patterns. In: Proceedings of the Fourth Annual IEEE-EMBS Information Technology – Applications in Biomedicine 2003, ITAB 2003, Birmingham, UK 18. Leff DR, Aggarwal R, Deligani F et al (2006) Optical mapping of the frontal cortex during a surgical knot-tying task, a feasibility study. Lect Notes Comput Sci 4091: 140–147 19. Poole A (2003) The implications of Modernising Medical Careers for specialist registrars. BMJ 326:s194
Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods
11
Jane M. Blazeby
Abbreviations
Contents Abbreviations ..................................................................... 129 11.1
Introduction ............................................................ 129
11.2
What is Quality of Life? ........................................ 130
11.3
The Purpose of Measuring HRQL........................ 130
11.4
How to Measure HRQL ......................................... 131
11.4.1 Types of Instruments ................................................ 131 11.4.2 Developing and Validating HRQL Instruments ....... 132 11.5
Reporting Standards of HRQLin Randomized Controlled Trials and Other Research Settings.................................................... 134
11.5.1 11.5.2 11.5.3 11.5.4 11.5.5 11.5.6
Choosing a HRQL Instrument for the Research ...... Determining the HRQL Sample Size ....................... The Timing of HRQL Assessments ......................... Missing Data ............................................................ Dealing with Missing Data....................................... Analyses of HRQL Data ..........................................
11.6
The Future Role of HRQL in Evaluating Surgery .................................................................... 139
135 136 136 137 138 138
HRQL QOL PRO
Health-related quality of life Quality of life Patient reported outcome
Abstract Over the past decade, assessment of healthrelated quality of life (HRQL) has become a recognized end-point in randomized surgical trials and in other research settings. It is an important endpoint because HRQL captures the patients’ perspective of outcome and can be used to compliment clinical outcomes to influence decision making and inform consent for surgery. This chapter will consider the definition of HRQL, methods for developing HRQL tools, and using HRQL in outcomes research, and it will review the current and future research and clinical applications of HRQL assessment.
References ........................................................................... 139
11.1 Introduction
J. M. Blazeby University Hospitals Bristol, NHS Foundation Trust, Level 7, Bristol Royal Infirmary, Marlborough Street, Bristol BS2 8HW, UK e-mail:
[email protected] Over the past decade, assessment of health-related quality of life (HRQL) has become a recognized outcome in randomized surgical trials and in other research settings. It is an important endpoint because HRQL assessment captures the patients’ perspective of outcome. Information about the patients’ perception of the benefits or harms of surgery can be used alongside traditional surgical outcomes to provide a comprehensive treatment evaluation. This information can potentially be used during surgical decision making and informed consent. It is essential that HRQL is measured and reported accurately, and an enormous amount of effort has been invested in developing and validating HRQL instruments. More recently,
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_11, © Springer-Verlag Berlin Heidelberg 2010
129
130
research into the interpretation of HRQL outcomes has been published, and the influence of HRQL in clinical decision making is beginning to be understood. This chapter will consider the definition of HRQL, methods for developing HRQL tools, and using HRQL in outcomes research, and it will review the current and future research and clinical applications of HRQL assessment.
J. M.Blazeby
the terminology for QOL, HRQL, and PROs, it is recommended that in surgical research that include as assessment of QOL, a clear definition of the domains of interest is provided. In this chapter, HRQL refers to a multidimensional construct that is reported by the patient. In this chapter, the phrase HRQL will be used (Box 1 summarizes key definitions). Box 1
11.2 What is Quality of Life? There is currently no internationally agreed definition for the construct of quality of life (QOL). In layman’s terms, the phrase has an abstract notion of happiness or satisfaction that may be related to current or past events and the individual’s personal views of life. Within a scientific context, however, an assessment of QOL refers to the patients’ perception of their health, where health is defined as a multidimensional construct, with physical, social, and emotional components. The phrase “quality of life,” therefore, has the potential for confusion between a colloquial and scientific definition, and more recently, the term “health-related quality of life” (HRQL) is more widely used. HRQL, however, is also ill defined, but some fundamental evidence is accumulating to suggest that HRQL is a multidimensional construct that can include two or more domains of health. It is also generally accepted that a measure of HRQL is an individual’s perception of how the illness or treatment impacts more than one of these key domains. The focus on the importance of patients themselves reporting outcomes has now led to a new terminology. “Patient reported outcomes” (PROs) are outcomes that assess any aspect of a patient’s health that come directly from the patient, without the interpretation of the patient’s responses by a physician or anyone else [1]. Sometimes PRO measures may be confused with measures of QOL or HRQL. Although both refer to measurements made by the patients themselves, the broad definition of QOL means that it is multidimensional, where a PRO can focus much more narrowly, for example, on a single symptom such as pain. The key factor is to ask the patient themselves to rate the problem or symptom because observers are poor judges of patients’ views. Many studies have shown that observations made by doctors, nurses, carers, or research staff of patients’ QOL or HRQL differ from that reported by patients themselves. Observer assessments may be over or underestimates of the scores from patients, and there are some trends in these assessments for particular conditions. Because of the potential for confusion with
Key definitions
Quality of life (QOL) – any aspect of health, life satisfaction, or happiness Health-related quality of life (HRQL) – a multidimensional assessment of health Patient reported outcome (PRO) – a self-report of one or more aspects of health Health – physical, social, and emotional well-being Item – a question that addresses a single domain of HRQL/QOL/PRO Scale – two or more items that together assess a single domain of HRQL/QOL/PRO Global item – a single item that assesses QOL/ HRQL
11.3 The Purpose of Measuring HRQL Accurate study of the effectiveness and cost effectiveness of surgical interventions is essential to evaluate new procedures. Outcomes are of the greatest concern for patients who require information about operative mortality, morbidity, survival, and relief of symptoms. Systematic outcome assessment of surgical procedures may also be used to inform purchasers and providers of health care. It is, therefore, critical to use common data sets to prospectively audit key outcomes. Most routine health services provide only aggregate information regarding the frequency of outcome events with no information linked to risk, the use of services, or patientreported outcomes. Routine aggregate data are of some benefit, but may be misleading when coexisting diseases contribute to the outcome, and they are also limited in their perspective, focusing solely upon biomedical or economic endpoints. In addition to these standard outcomes, the importance of measuring patients’ views has recently been recognized because they capture details that may have been overlooked and take a holistic view of the impact of treatment on psychosocial health as well as physical well-being. Although there has been a growing interest in assessing HRQL in clinical trials, there has
11 Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods
also been some evidence to suggest that the information does not influence clinical decision making. Two systematic reviews have analyzed how randomized trials in breast and prostate cancer include HRQL outcomes when reaching the conclusion of the trial [6,8]. Overall, in only a small proportion of trials did the HRQL outcomes influence treatment recommendations. Another systematic review of trials of surgical oncology, however, demonstrated that HRQL outcomes influenced treatment recommendations, and the authors of the surgical trials stated how important HRQL outcomes were for fully informed consent [5]. It is possible that HRQL is a particularly relevant outcome in surgical decision making because surgery has an immediate impact on HQRL that is irreversible. Surgery also has long-term impacts on generic and disease-specific aspects of HRQL. Providing accurate information for patients undergoing surgery about the expected impact on HRQL as well as morbidity, mortality, and functional outcomes is critical for fully informed consent. This is an area where surgeons require training to ensure that the information is communicated, and currently, best methods for doing this have not been established. Box 2 summarizes key reasons for the evaluation of surgical procedures with patient-reported outcomes alongside traditional clinical endpoints. Box 2 Key reasons for evaluating surgical procedures with measures of health-related quality of life to compliment standard outcomes 1. A detailed assessment of the disease course or treatment side effects that is unavailable from the review of other endpoints 2. An assessment of the patients’ perception of outcome which frequently differs from health professionals’ judgments of the patients’ opinion 3. Patient-reported outcomes may influence surgical decision making where benefits of surgery are marginal and possible risks and negative impact on HRQL severe 4. Patient-reported outcomes may influence surgical decision making where there are nonsurgical treatments of equivalent clinical value, but with a different impact profile on HRQL 5. Baseline self-reported health data may have prognostic value 6. Patient self-report health data may inform economic analyses
131
11.4 How to Measure HRQL 11.4.1 Types of Instruments Over the past few decades, many instruments for measuring HRQL have been developed. These are mostly paper-based because of the prohibitive costs of undertaking individual patient interviews with detailed qualitative analyses. Questionnaires are composed of items and scales. Items are single questions that address a single aspect of HRQL, whereas scales are two or more items that together address a single domain of HRQL (e.g., pain scale). Although single items may address a specific symptom or problem, some single items address global concepts. For example, the single global question, “Overall, what would you say your QOL has been like during the past week?” Global questions are interesting because their simplicity is attractive, but many aspects of life, not just health issues, influence individuals’ views of QOL. This means that within the context of evaluating healthcare, global items can be difficult to interpret because of different and often competing aspects of life that impact overall QOL. For clinical practice, therefore, where specific domains of health are treated and are deliberately being investigated, it is recommended to use a multidimensional assessment of HRQL that addresses key components of physical and psychosocial health. Questionnaires that are used to assess HRQL require careful development and full clinical and psychometric validation testing to be certain that they are fit for the purpose for which they are designed. Although many questionnaires exist, the quality of their supporting evidence is variable, and while choosing an instrument, this needs to be considered. Currently, the traditional methods for developing questionnaires are progressing and modern techniques including computer adaptive testing and item response theory are likely to provide more precise measures of HRQL [7,9]. This section will describe standard types of HRQL measures and consider how modern test theory will impact HRQL assessment.
11.4.1.1 Generic Instruments Generic measures of HRQL are intended for general use within healthy populations or any general disease group. They all include an assessment of physical
132
health and a selection of other key domains of HRQL. The Medical Outcomes Study 36-Item Short Form (SF-36) is probably the most widely used generic measure of HRQL, and the EuroQol is also influential because of its suitability for cost-utility analyses [4]. Within surgical studies, these types of measures are often not suitable to detect specific symptoms or functional consequences of an intervention because they do not contain scales and items addressing these problems. If they are used alone to evaluate surgery, it is therefore possible that the results may be misleading because they are not sufficiently sensitive to small beneficial or detrimental consequences of surgery. These measures are, however, very useful for making comparisons between disease groups.
J. M.Blazeby
disease-specific system with a core tool for patients with chronic illness and supplementary modules [3]. Although widely used in cancer, noncancer-specific FACIT questionnaires are also available. The FACT-G originally designed for patients with cancer has 27 items addressing four HRQL domains, physical, social, emotional, and functional well-being. It has four similar response categories to the EORTC QLQ-C30 and additional “somewhat” response. Some items are phrased positively and some negatively. The scoring may be performed for four scales or an overall summary score derived by summing item responses. The items refer to how patients have been feeling in the past week. This questionnaire has very high quality supporting clinical and psychometric data. Specific modules to improve the sensitivity and specificity of the EORTC QLQ-C30 or FACIT systems have been developed and tested for most cancer sites.
11.4.1.2 Disease-Specific Instruments Disease-specific measures focus upon HRQL problems that are relevant to specific disease groups or disease sites. For example, disease-specific measures for patients with cancer have been developed. Some of these are designed with a modular approach, with a core disease-specific tool that is relevant to all patients within that disease group (e.g., cancer), and site or treatment-specific add-on modules that supplement the core measure (e.g., breast cancer specific). Site-specific questionnaires address particular symptoms or morbidity of treatment. In surgical oncology, where HRQL assessment is an increasingly important outcome, there are several disease-specific tools that are widely available. The EORTC QLQ-C30 is a cancer-specific 30-item questionnaire. It incorporates five functional scales (physical, emotional, social, role, and cognitive), a global health scale, and nine single items assessing symptoms that commonly occur in patients with cancer [2]. All items in the questionnaire have four response categories, excepting for the global health scale that uses a seven-point item response ranging from very poor to excellent. High scores represent higher response levels, with high functional scale scores representing better function and higher symptom scores representing more or worse symptoms. It is available in over 50 languages and widely validated and used in international clinical trials in oncology. The functional assessment of chronic illness therapy (FACIT) measurement system is a similar modular
11.4.2 Developing and Validating HRQL Instruments Developing measures to assess HRQL requires attention to detail multidisciplinary expertise and resources. Investing time during the early phases of questionnaire development will help to avoid problems during psychometric testing and ensure that the tool is comprehensive and that it will be able to assess HRQL in the population for which it is intended. Careful documentation of the development of a new measure will also provide the evidence that is needed to demonstrate that the process is robust and based upon patients’ views and opinions, rather than those of health professionals. There are several well-described phases of questionnaire development, and it is essential to follow them sequentially.
11.4.2.1 Literature Search Before creating a new HRQL measure, it is important to have a working definition of HRQL for the intended research question. This will identify the dimensions of HRQL to be included in the new tool. It is necessary to consider whether the new tool will assess generic aspects of HRQL or particular problems related to the disease
11
Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods
and treatment. A detailed literature search in relevant medical and psychosocial databases will identify existing instruments or scales that address relevant HRQL issues. Following the literature search, it will be possible to generate a list of potential HRQL issues to be considered for inclusion in the instrument.
11.4.2.2 Selection of HRQL Domains in Instrument After compilation of the initial list of HRQL issues, expert review is required to check for completeness and face validity. It is essential to include a multiprofessional group to do this including specialists, generalists, doctors, nurses, and other health professionals. The HRQL issues need to be described as succinctly as possible, with minimum overlap in content. Rare or unusual HRQL issues that are sometimes experienced by patients should be retained at this stage. Patients themselves should also review the content of the HRQL list of issues. Although traditionally reviews by patients were informal and not documented in detail, there is an increasing need to formally undertake in-depth qualitative tape recorded interviews with patients to consider questionnaire content, because of the increasing pressure to document that HRQL measures are patient generated. A broad range of patients should be interviewed, patients with different categories of disease severity and patients who have experienced the range of potential treatments for the disease of interest, as well as purposefully sampling patients of a mixed sociodemographic background. Following this comprehensive assessment of the literature, health professionals, and specific patients, a list of HRQL issues to be included in the questionnaire will be complete. These need to be transformed into specific items (questions) using standard questionnaire guidance.
11.4.2.3 Writing Items and Scales Questions should be in comprehensive language (brief and simple), and each question should assess just one HRQL issue. It is a common error to attempt to assess two dimensions of HRQL in one item, e.g., Have you had nausea and vomiting? This question may confuse respondents who suffer severe nausea, but do not actually vomit. It is also recommended to avoid double
133
negatives because of the potential of misleading respondents. At this phase of questionnaire development, the response format for the questionnaire is agreed upon. It is generally recommended to avoid binary Yes/No responses, but to use an ordinal scale where the patient can respond between absent or severe scores, not at all and very much. Most questionnaires assign integers to each response category so that the question can be scored. The layout and format of the questionnaire, using large font, underlining, and bold test, will help draw patients’ attention to particular questions or instructions. It is essential that colleagues and collaborators with experience of designing and developing questionnaires review the provisional tool as well as the multiprofessional group involved in selecting HRQL issues.
11.4.2.4 Pretesting the Provisional HRQL Instrument Pretesting the provisional HRQL questionnaire is undertaken to ensure that the target population understands the newly created questions. This essential phase of questionnaire development also checks that the wording and formatting of the tool are straightforward and not confusing or offensive. It involves approximately 10–20 patients or represent a range of the target population. If problems with ambiguous questions or difficult phrasing are identified, the revised items also require repeat pretesting. In this phase of questionnaire development, patients complete the whole tool and are subsequently interviewed to consider each item separately. This phase also allows some general questions about the whole questionnaire to be asked, such as are there any missing questions that are relevant to you and how much time or assistance was required to complete the questionnaire.
11.4.2.5 Clinical and Psychometric Validation of HRQL Instruments It is essential that an instrument to measure HRQL has good measurement properties. Good measurement properties of an instrument include validity and reliable data to demonstrate that the instrument may produce results that are reproducible, sensitive, and responsive to changes in clinically relevant aspects of
134
HRQL. It is traditional to validate a new measurement tool by comparing its output with those produced by a gold standard instrument. There is, however, no gold standard measure of HRQL. Validity is, therefore, inferred for each tool by compilation of several pieces of evidence. There are three main areas that need to be considered, content, criterion, and construct validity. Content validity concerns the extent to which the HRQL items address the intended HRQL areas of interest. Evidence to show that the instrument fulfils this aspect of validity is collected during the above early phases of questionnaire development. Criterion validity considers whether the new questionnaire shows anticipated associations with external criteria assessed with already validated tools (e.g., the HRQL scale addressing pain is related to other pain measures or requirement for pain relief). Construct validity examines the theoretical associations between items in the questionnaire with each other and with the hypothesized scales. This is examined by investigating both expected convergent and divergent associations. For example, a scale assessing fatigue would be expected to have convergent associations with physical function, but little correlation with some other aspects of health (e.g., taste). The reliability of a measurement tool concerns the random variability associated with each measurement. Where the HRQL of a patient is stable between two time points, then it is expected that HRQL scores will be similar on both occasions (not subject to random error). Becoming familiar with the reliability of a tool is essential for its use in everyday clinical practice. A measurement of hemoglobin may be 12 or 13 g/dL on two separate occasions 1 week apart. Provided the patient does not show any evidence of blood loss, these two measures would not be of concern, because it is accepted that the reliability of a full blood count is within these boundaries. This type of reliability for HRQL questionnaires is formally tested with test–retest methodology. Patients complete the HRQL measure on two separate occasions when their health is stable. The correlation between the two measures is examined, and it is expected to be high. Interrater reliability examines the agreement between two observers assessment of HRQL. Since patients themselves are usually regarded as the best assessor of their health, the interrater reliability may only need to be examined in situations where patients themselves are unable to self-report, and it is essential to
J. M.Blazeby
use a proxy to assess HRQL (e.g., severe neurological conditions). A sensitive measure of HRQL will be able to detect HRQL differences between expected clinically different groups of patients, and testing this construct, this may be referred to as testing known group comparisons. For example, patients with metastatic advanced cancer are expected to report worse HRQL than patients with localized disease. Responsiveness is similar, but is related to the ability of an instrument to detect improvement or deterioration within an individual. All these aspects of instrument clinical and psychometric validation are important and the process of demonstrating these features of HRQL tools is continuous. There are currently no internationally agreed standards of the minimum amount of evidence to prove that a tool is valid and reliable, rather it is a cumulative process. Indeed, after the initial validation and publication of a new tool, it is very important for independent groups to further test the measurement properties of the tool and to produce data to further support or refute the validity and reliability of the tool. Internal consistency is the other commonly used method to test reliability of a tool. This refers to the extent to which items are related. Cronbach’s coefficient is the most widely used method for this purpose.
11.5 Reporting Standards of HRQL in Randomized Controlled Trials and Other Research Settings While the use of HRQL instruments in clinical research has continued to increase steadily over recent years, many are still characterized by inadequate reporting. This probably reflects inadequate study design, and both robust design and high quality reporting are required. The following section provides guidance on these issues that need consideration when including an assessment of HRQL in a clinical trial or another type of research study. Robust HRQL study design and detailed reporting will allow peer reviewers and subsequent readers of the research to assess the validity of the results and to reproduce the methods if desired. A summary of this process is summarized in Box 3.
11
Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods
135
Box 3 Key issues to consider when assessing HRQL in a randomized trial, longitudinal, or cross sectional study Study objective
HRQL questionnaire
Study population
Time points
Data analyses
Data collection
Practical issues
Is HRQL a primary or secondary endpoint? Which aspects of HRQL are of particular interest? At which time points will HRQL change? Does it have relevant clinical and treatment-related HRQL domains? Are the response categories appropriate to the study question? Does it have documented validity and reliability? Is it sensitive to expected HRQL changes? Has it been used before in this patient population? (did it work?) How long will it take to complete? Availability of translations (if applicable) Is it completed by the patient themselves? Socio-demographic details (level of education, gender, age, native language, and cultural variation) Clinical details (performance status, level of anxiety, disability) Select timing of assessments of capture relevant HRQL changes Review the time frame of the questionnaire Minimize frequency of assessments to reduce respondent burden Minimize frequency of assessments to simply data analyses Define HRQL hypotheses How is the questionnaire scored? How are changes in HRQL interpreted (minimally important difference, clinical relevance)? Analyses plan (e.g., to account for multiple assessments) Dealing with missing data for random or nonrandom reasons Mode of administration – self-completion, clinician-completion, face-to-face interview, or telephone Clear practice to follow in the event of missing assessments Clear practice to follow in the event of missing items/pages Document reasons for missing data Cost (license, printing, postage, training of personnel) Check the latest version of the questionnaire If a battery of measures, consider the order of the questionnaires Check compliance and plan measures to improve, if necessary
11.5.1 Choosing a HRQL Instrument for the Research The appropriate and adequate assessment of HRQL within a trial or another research setting depends on clearly defining the objectives and endpoints of the project. It is essential to initially decide whether HRQL
is the primary or secondary endpoint of the study. This will help select a suitable instrument to assess HRQL, in order to match the specific characteristics of the questionnaire to the objectives of the trial. The choice of instrument will also depend upon the size and sociodemographic features of the study population. Level of education, gender, culture, native language, and age range are important factors that will determine whether
136
questionnaires are completed and completed accurately. The specific nature of the anticipated effects and the expected time for them to be exerted will influence the type of HRQL measure selected and the timing and frequency of its administration. A measure should be selected that is sensitive to capture both the positive and negative effects of the intervention, as the side effects of treatments (such as postsurgical pain or fatigue) may outweigh potential benefits to HRQL (such as symptom alleviation) perceived by patients. An understanding of whether these effects are transient or long-lasting will further influence the timing of the HRQL assessments and how assessments from different time points should be interpreted. Consideration of the complexity and nature of a questionnaire’s scoring system as part of instrument selection will help decide whether HRQL data can be used to address the trial objectives. Overall summary scores may be obtained with some HRQL instruments, which have the attraction of producing a single score that can be used to compare different populations and patient groups. Overall scores, however, may fail to identify where interventions lead to improvements of one aspect of HRQL, but deterioration in another. It is, therefore, recommended that multidimensional questionnaires with relevant symptoms and functional scales are used to provide HRQL data to inform treatment decisions. Finally, there are practical issues to consider during instrument selection: the size; timing, length, and frequency of measurement; mode of administration; availability of translations; cost of the questionnaire; and other resource considerations (postal survey vs. clinical setting, self- vs. interviewer-completion). As considered above, standardization of the administration process is critical to ensure unbiased data collection. Many HRQL questionnaires have not been validated outside their original country of origin, and ensuring that high quality translations are available in target languages is, therefore, essential for an international study.
11.5.2 Determining the HRQL Sample Size Determining the number of participants required is an integral part of any clinical trial, and this provides evidence to justify the size of the study and to confirm that it is powerful enough to answer the questions it is
J. M.Blazeby
designed to address. It is wasteful to recruit more participants than necessary, but underrecruitment, a more common occurrence in clinical trials, is bad practice because the energy that has been invested in the study is wasted if the study is insufficiently powerful to demonstrate a clinically important effect and to answer the questions it was designed to address. All these issues that apply to clinical trials also apply to incorporating HRQL in a trial. If HRQL is the primary trial endpoint, it is essential to adhere to the above standards. In trials where HRQL is a secondary endpoint, however, it is uncommon for sample size calculations to be performed for the HRQL outcomes, and the size of the trial is instead dictated by the primary endpoint. One of the difficulties in undertaking a HRQL sample size calculation occurs because there is little HRQL evidence on which to precisely estimate the effect size, and thus, it is common practice to consider a range of expected benefits and possible sample sizes and assess their adequacy to power the study. Different sample size methods may lead to different estimates, but the final approach should be chosen based on relevance to the trial objectives, including the type and number of HRQL endpoints, the proposed analyses, and available information on the underlying assumptions of expected benefit. Sample size calculations must also take into account possible nonresponse and loss to follow-up and control for confounding and examination of subgroup effects. While HRQL trials may include numerous HRQL outcomes, it is recommended that a maximum of four or five be included in formal analyses, taking into account the effect of multiple significance testing.
11.5.3 The Timing of HRQL Assessments Exactly when HRQL should be measured is a crucial part of the trial design in terms of gathering data at relevant time points to enable the objectives of the trial to be adequately addressed. A mandatory baseline assessment prior to randomization and the start of treatment is recommended for all randomized trials assessing HRQL, in order that changes due to treatment and disease status can be measured and to check whether there is equivalence of baseline characteristics in the treatment and comparison groups. There is also evidence to indicate that baseline HRQL may be a valuable prognostic marker for clinical outcomes, such
11
Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods
as survival or response to treatment. In study designs evaluating surgical procedures, a pretreatment assessment of HRQL is also recommended to provide baseline data. Capturing HRQL before treatment can be difficult, but it is essential because it is extremely difficult for patients who have undergone surgery to look back to reflect about their symptoms and functional ability before the operation. In a cross-sectional study, it is not possible or desirable to capture baseline (before start of treatment) HRQL data. Choosing the time points for follow-up of HRQL will depend upon the research question, the resources and the natural history of the disease, and likely side effects of treatment. These assessments can be (i) timebased, involving administration for a specific number of days/weeks after randomization, regardless of the treatment schedules, or (ii) event-based, dependent on specific treatment cycles, or serious or acute effects. A combination of both approaches may be suitable in some cases and allow treatment delays to be taken into account. The relevance of timings should be carefully considered – different timings can lead to different results. Other issues to consider are the time scale of the questionnaire being used (e.g., current status versus recall of symptoms and HRQL in the past week), frequency of assessments, and the accessibility of patients at various time points (e.g., assessments during clinic visits will be limited to the timings of the appointments). Posttreatment assessments should be timed according to the research hypotheses and whether or not HRQL was specified as a primary endpoint, although these should be undertaken at equal times in each arm with respect to randomization rather than the end of treatment to avoid bias. The practicalities of obtaining HRQL assessments up until death and, if necessary, at relapse should also be considered, particularly if patients are to be withdrawn from the study at relapse. Although it is theoretically tempting to obtain as many HRQL assessments as possible, they should be kept to a minimum to avoid overburdening patients (and thus increase compliance) and simplify data collection and analyses.
11.5.4 Missing Data Difficulties with questionnaire compliance are commonplace in many studies, which is partly related to investigations in patients with a poor survival prognosis
137
or illness. Although compliance with questionnaire completion is as high as possible and over 80%, it is critical to answer any research question and to avoid response bias. A number of factors can be addressed to improve compliance and reduce missing data. Missing data may take two main forms: (i) item nonresponse, where data are missing for one or more individual items, or (ii) unit nonresponse, where the whole questionnaire is missing for a participant, and due to missing forms, the participant is dropping out of the study or entering it late. Causes of missing data include poor study design (unclear timing of assessment and poor choice of questionnaire), administrative factors (failure of research staff to administer questionnaires), and patient-based factors (deteriorating health and refusal to complete questions). Consideration of the feasibility of the trial design, attention to protocol development, training of staff, and providing adequate resources may address organizational problems and uniformity of protocol and HRQL assessment across participating centers, and regular communication between the study coordinator, local investigators, the data manager, and those responsible for administration of the questionnaires is essential. A pilot study may be beneficial in improving organization and preparing for unforeseen difficulties. Patient-based sources of missing data may be tackled by choosing an appropriate instrument and attention to participants’ motivation by providing a clear explanation of the reasons of why, when, and how HRQL assessments will be made, whom they may contact for help, and what will happen to the data they provide (an information sheet which details confidentiality and data dissemination is typically required by most ethics committees). Proxy completion by a carer, clinician, or anybody significant may be considered in the event of missing data due to participants’ inability to complete the questionnaire. This should be considered prior to the start of the trial, if it is anticipated that this might be a significant problem. It should be noted, however, that there is evidence to suggest differing levels of concordance between patients and proxies according to the dimension being assessed, which may introduce an element of bias into the HRQL assessments. While it is possible to prevent much missing data, a certain amount should be expected within any study. The protocol should contain clear instructions on what procedures to follow in the event of a missing questionnaire or missing data for individual items, such as whether the participant should be contacted. In all
138
cases, it is good practice to maintain a record of the reasons for missing data, in order to ascertain the extent to which this was related to the patient’s HRQL and to inform the analyses and interpretation of the data. The relative amount of missing data, the assumed cause, the sample size, and the type of intended analyses will determine the degree to which missing data are a problem, and it will critically inform the interpretation of the results of the study.
11.5.5 Dealing with Missing Data Poor compliance resulting in missing data can have a significant detrimental impact on the analyses and interpretation of HRQL data. Firstly, fewer observations may result in the power of the study to detect an effect being compromised. Secondly, missing data may introduce selection bias into a trial, and thus, compromise its validity, particularly if low compliance is associated with less well patients who have a poorer HRQL, as many studies have shown to be the case. It is, therefore, important that the impact of missing data is carefully addressed, and that the potential cause of the missing data is understood, as the most appropriate method for dealing with it will depend largely on the assumed mechanism by which it is missing: if the reasons for missing data are completely unrelated to the respondent’s HRQL, it is classified missing completely at random. If the likelihood of missing data only depends on previous scores but not current or future scores, it is considered missing at random. Data are only considered not missing at random if the “missingness” is actually related to the value of the missing observation. The relevant approach for dealing with each type of missing data varies depending on whether potential bias due to data not missing at random needs to be addressed. Methods are available to test the assumption of whether or not data are missing at random. Having a record of the reasons for missing data in a trial is particularly important here. Missing values for individual items may be imputed (filled in) to complete the data in order that a full analysis may be undertaken. Most commonly, the mean of answered questions is imputed, provided that at least half the items are completed, although this may not be suitable for scales whose items are ordered hierarchically according to difficulty. Other less common
J. M.Blazeby
approaches include imputation based on the responses of other respondents or regression of missing items based on other scale items. Some instruments are provided with instructions on how to score the question when items are missing. Methods for dealing with missing whole questionnaires/assessments in longitudinal (repeated measures) studies when missing data are assumed to be ignorable include complete case analysis, which involves removing patients with incomplete data, and available case analysis, which uses all available cases on a specific variable and is subsequently more desirable. Alternatively, data can be filled in using imputation-based methods such as last observation carried forward, single mean imputation, predicted value imputation, or hot-deck imputation. Statistical techniques such as likelihood-based methods have been developed for nonignorable missing data, but they are complex and controversial and should, therefore, be applied with caution. More sophisticated models have been developed specifically for nonignorable missing data, which is likely in HRQL trials, but they are limited by their complexity and lack of availability and interpretability. They are often accompanied by sensitivity analyses, which are used to compare the appropriateness of employing a given strategy.
11.5.6 Analyses of HRQL Data The analyses of HRQL data require appropriate expertise that may be difficult to find because of the specific complexities in dealing with multiple HRQL assessments and missing data as described above. Many of these issues can be overcome by carefully identifying a priori one or two HRQL outcomes of principal interest and by working with a statistician from the outset of the study. If the research question and study design are clearly stated in the protocol and the HRQL hypothesis also established, analysis of HRQL data should be possible and overexploring and interpreting data avoided. Using descriptive statistics to illustrate the impact of surgery on HRQL has an important role for the clinical interpretation of HRQL scores. At present, there is still a lack of general understanding about the impact of most surgical treatments on HRQL, and using graphical methods to illustrate these changes will aid patients and surgeons. It is not within the scope
11 Health-Related Quality of Life and its Measurement in Surgery – Concepts and Methods
of this article to describe the details of all issues for reporting HRQL results, but a multidisciplinary approach to writing the protocol and analyzing and interpreting results with surgeons, statisticians, and social scientists should achieve outputs that are accurate and understood by the relevant audience.
11.6 The Future Role of HRQL in Evaluating Surgery Over the past two decades, methods for provision of accurate patient-centered outcome data have become established for most surgical settings. Standards for using these tools in randomized trials, longitudinal series, and cross-sectional studies are also being established. The time for evaluation of surgery with a combination of traditional clinical and biomedical outcomes and patientreported outcome data is, however, only just being realized, and many clinicians are unfamiliar with standard instruments to assess HRQL, questionnaire scoring systems, and the clinical interpretation of the results. It is also uncertain how to communicate HRQL data to patients themselves and whether this type of information will influence surgical decision making. Further work is needed in each of these areas, and ensuring that an organic collaboration between surgeons, statisticians, social scientists, and patients themselves is achieved will ensure that patient-centered outcomes will be appropriately
139
incorporated into surgical research and their role in everyday clinical practice will become clear.
References 1. Patient-reported outcome measures: use in medical product development to support labeling claims draft guidance. Available from http://www.fda.gov/cber/gdlns/prolbl.htm 2. Website for EORTC quality of life questionnaires and manuals. Available from http://www.eortc.be/QOL 3. Website for FACIT quality of life questionnaires and manuals. Available from http://www.facit.org 4. Website for SF36 quality of life questionnaires and manuals. Available from http://www.sf-36.org 5. Blazeby JM, Avery K, Sprangers M et al (2006) Healthrelated quality of life measurement in randomized clinical trials in surgical oncology. J Clin Oncol 24:3178–3186 6. Efficace F, Bottomley A, Osoba D et al (2003) Beyond the development of health-related quality-of-life (HRQOL) measures: a checklist for evaluating HRQOL outcomes in cancer clinical trials–does HRQOL evaluation in prostate cancer research inform clinical decision making? J Clin Oncol 21:3502–3511 7. Fayers PM, Machin D (2000) Quality of life: assessment, analysis, and interpretation. Wiley, New York 8. Goodwin PJ, Black JT, Bordeleau LJ et al (2003) Healthrelated quality-of-life measurement in randomized clinical trials in breast cancer–taking stock. J Natl Cancer Inst 95:263–281 9. Streiner DL, Norman GR (1995) Health measurement scales: a practical guide to their development and use, 2nd edn. Oxford University Press, Oxford
Surgical Performance Under Stress: Conceptual and Methodological Issues
12
Sonal Arora and Nick Sevdalis
Contents 12.1
Introduction ............................................................ 141
12.2
Part 1: Models and Theories of Stress .................. 142
12.2.1 Systemic Stress: Selye’s Theory .............................. 142 12.2.2 Psychological Stress: The Lazarus Theory .............. 142 12.2.3 The Yerkes–Dodson Law ......................................... 143 12.3
Part 2. Measures of Stress in Surgery .................. 143
12.3.1 Objective Measures of Stress ................................... 143 12.3.2 Subjective Measures of Stress .................................. 144 12.3.3 Combined Subjective and Objective Measures of Stress.................................................... 144 12.4
Abstract This chapter provides an overview of surgical performance under stressful conditions, often present in the operating room (OR). We begin with an overview of models and theories of stress and their relationship to human performance. We then present the current state of the art in the measurement of stress in the context of surgery and measures of surgical performance. Finally, we summarise evidence on the impact of stress on performance in the OR. We conclude with a discussion on the implications of the existing evidence based on surgical stress for the training of junior surgeons and propose directions for future empirical research.
Part 3: Measures of Performance in Surgery ...... 144
12.4.1 Measures of Technical Performance ........................ 145 12.4.2 Measures of Non-Technical Performance ................ 145 12.5
Part 4: Impact of Stress on Surgical Performance ....................................... 146
12.6
Part 5: Discussion .................................................. 146
12.7
Implications for Surgical Training ....................... 147
12.8
Future Research Agenda ....................................... 148
References ........................................................................... 149
S. Arora () Department of Biosurgery and Surgical Technology, Imperial College London, 10th floor QEQM, St. Mary’s Hospital, South Wharf road, London W2 1NY, UK e-mail:
[email protected] 12.1 Introduction Stress has become a common denominator in today’s fast-paced, complex society. Anecdotal evidence suggests that people experience stress in relation to all aspects of their lives – including personal and family relationships, social encounters and, perhaps most importantly, professional life. Surgery is not an exception. Acute or chronic, stress is present in all facets of surgery – an inherently risky occupation with performing in the OR a considerable pressure itself. Organisational issues, new technologies, relationships with colleagues and health care staff can also compound the stress that a surgeon experiences daily – as can family and personal problems [1]. Despite the obvious prevalence of stress in surgery, little empirical research has been carried out that addresses either the sources of stress or its impact on surgeons’ performance. More recently, developments in the delivery of surgical services and surgical training are contributing to stress being given significant attention from surgical researchers and trainers alike.
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_12, © Springer-Verlag Berlin Heidelberg 2010
141
142
These developments include the following: • Changes in the way surgical care is delivered: working hours are being reduced (for instance, through the European Working Time Directive [2] and thus shift-working is increasing). • Changes in surgical training: high-fidelity simulators are becoming increasingly available. In addition, there is an increasing realisation that the apprenticeship model of learning (“see one, do one, teach one”) is not the most appropriate for the development of surgeons in terms of technical [3] and non-technical skills (e.g. leadership and decision-making) [4]. • Increased prominence of patient safety concerns: the safety of medical and surgical patients is becoming an increasingly prominent feature of their care through well-publicised reports from the Institute of Medicine [5], the UK’s Department of Health [2] and peer-reviewed evidence[6, 7]. The present chapter provides an overview of surgical performance under stressful conditions, often present in the OR. We begin with an overview of models and theories of stress and their relationship to human performance (Part 1). We then present the current state of art in the measurement of stress in the context of surgery (Part 2) and measures of surgical performance (Part 3). Finally, we summarise evidence on the impact of stress on performance in the OR (Part 4). We conclude with a discussion on the implications of the existing evidence based on surgical stress for the training of junior surgeons. We also propose directions for future empirical research (Part 5).
12.2 Part 1: Models and Theories of Stress Stress is typically defined as bodily processes resultant from circumstances that place physical or psychological demands on an individual [8]. The external forces that impinge on the body are “stressors”. A stressor is defined as any real or imagined event, condition, situation or stimulus that instigates the onset of the human stress response process within an individual [9]. Stress is only present when demands outweigh perceived resources [10]. There are two main categories of theories that focus on the specific relationship between external demands
S. Arora and N. Sevdalis
(stressors) and bodily processes (stress). One approach is based in physiology and psychobiology [11]. This approach is known as the “systemic stress approach”. The second approach has been developed within the field of cognitive psychology and is known as the “psychological stress approach” [12, 13].
12.2.1 Systemic Stress: Selye’s Theory This approach stems largely from the work of the endocrinologist Hans Selye, who observed that in animal studies a variety of stimulus events (e.g. heat, cold and toxic agents) are capable of producing common effects that are not specific to either stimulus event. These non-specifically caused common effects constitute the stereotypical, i.e. specific, response pattern of systemic stress. According to Selye [11], stress is defined as “a state manifested by a syndrome which consists of all the non-specifically induced changes in a biologic system”. This state has been termed “General Adaptation Syndrome”. Key criticisms of this approach include its failure to include psychological aspects of stress and whether human responses to stressors mimic those of animals.
12.2.2 Psychological Stress: The Lazarus Theory In this approach, stress is regarded as a relational concept, i.e. the relationship (“transaction”) between individuals and their environment [12]. This transactional, or interactional, approach focuses on thoughts and awareness that determine individuals’ stress responses. Psychological stress refers to a relationship with the environment that the person appraises as significant for his or her well-being and in which the demands tax or exceed available coping resources [14]. In this approach, the following two are the key mediators within the person–environment transaction: (i) cognitive appraisal: individuals’ evaluation of the significance of what is happening for their well-being and (ii) coping: individuals’ efforts in thought and action to manage specific demands [15]. The concept of appraisal is based on the idea that emotional processes (including stress) are dependent upon individual expectancies of the significance and outcome of an
12 Surgical Performance Under Stress: Conceptual and Methodological Issues
event. This helps to explain individual differences in quality, intensity and duration of an elicited emotion in environments that appear very similar. “Primary appraisal” is the individual’s evaluation of an event as a potential hazard to well-being. “Secondary appraisal” is the individual’s evaluation of his/her ability to handle the event. By being a function of these appraisals, levels of experienced stress depend on the subjective interpretation of whether an event poses a threat to the individual (primary appraisal) and whether there are resources to cope with the stressor (secondary appraisal). Coping behaviours follow these appraisals. According to Lazarus and Folkman [13], coping is defined as “the cognitive and behavioural efforts made to master, tolerate, or reduce external and internal demands and conflicts among them”. Coping includes attempts to reduce the perceived discrepancy between situational demands and personal resources [16].
12.2.3 The Yerkes–Dodson Law One of the few themes that seems to span virtually all existing evidence on stress and human performance is that performance peaks when the subject is in some optimal stress or arousal state, above or below which efficiency of performance decreases – an idea known as the Yerkes–Dodson law [17] (Fig. 12.1). Optimal stress or arousal state decreases with increasing task difficulty.
performance
Yerkes-Dodson
arousal
Fig. 12.1 The Yerkes–Dodson law of the relationship between stress/arousal and performance
143
12.3 Part 2: Measures of Stress in Surgery Stress in humans involves a physiological response and a cognitive-behavioural component. In assessing stress, therefore, both these components should be measured systematically. In this section, we present objective measures of stress that capture the former and subjective measures of stress that capture the latter in the context of surgery.
12.3.1 Objective Measures of Stress The normal physiological response to stress (“fight or flight”) results in an endogenous catecholamine release leading to increased cardiac activity. This can be determined by measuring the heart rate (HR). Studies [18–23] have used HR as a proxy measure for stress, finding the mean HR to be elevated for surgeons during an operation [18, 21], but also that experience moderated the effect of stress, with seniors exhibiting less change in HR compared to juniors [20]. Heart rate variability (HRV): HRV is the oscillation in the interval between consecutive heart beats and between consecutive instantaneous HRs [24–26]. It is an indicator of the sympathovagal balance during surgery [27] – i.e. an index of autonomic function. As stress has been linked to increased sympathetic and parasympathetic activity [28], changes in mental stress which alter autonomic activity can therefore affect HRV. HRV has been used as an indirect measure of stress/mental strain in some studies [19, 25, 26]. Power spectral analysis of HRV allows assessment of the sympathovagal activities regulating the HR by quantitatively evaluating beat-tobeat cardiac control. Spectral components include a low frequency component (LF) which rises with increased sympathetic activity [29] and a high frequency component (HF) [25] which rises with increased vagal activity [24]. The ratio of LF/HF therefore gives an overall picture of the ANS [24, 27]. The higher the ratio, the greater the stress. Operating has been found to affect HRV [26]. However, as with HR, the effects of physical activity and mental stress on HRV cannot be separated and the measure is subject to individual differences. Skin conductance level is known to rise with increased stress [30] and has been used as an objective measure to evaluate the activity of the sympathetic nervous system [31–35].
144
S. Arora and N. Sevdalis
Table 12.1 Aggregated movement of stress indicators across cases (Arora et al., 2009) Heart rate Elevated Cortisol elevated STAI (self-report)
Normal
Cortisol dropped
Cortisol elevated
Cortisol dropped
Total
Elevated
15
6
1
1
23
Dropped
1
4
1
17
23
Did not change
2
2
0
4
8
18
12
2
22
54
Total
The electrooculogram utilises an ergonomics workstation to collect data from which the number of eye blinks can be counted. The number of eye blinks increases as stress levels rise [31, 33]. Salivary cortisol is an adrenocortical hormone which rises as a result of the neuroendocrine response to a stressful situation and is widely used in nonsurgical studies. In surgery, Jezova et al. [36] confirmed that cortisol levels were higher during a work day for both junior and senior surgeons compared to a nonwork day, suggesting higher stress levels.
12.3.2 Subjective Measures of Stress These involve the subject’s self-report of their stress levels, typically assessed through a questionnaire. In surgery, only few studies have so far used a selfreported assessment of stress. Of those, some have asked subjects to report their stress using questionnaires [20, 27, 31, 37], whereas others have relied on interviews [1, 38]. Low penetration of such self-report tools in surgery is a significant problem in this literature, since such tools capture subjects’ subjective experience of stress – which, as we discussed above, is a key component of stress in humans. A self-report measure of particular relevance to surgery is the Spielberger’s Stress Trait Anxiety Inventory (STAI), which also exists in short, six-item form [39–41].
12.3.3 Combined Subjective and Objective Measures of Stress Very few surgical studies to date [31, 33, 34] have used both objective and subjective measures of stress. In a
recent study conducted by our research group (Arora et al., 2009), a multimodal stress assessment tool was developed and validated using 55 real cases. In this study, stress was assessed subjectively using the STAI scale (pre- and post-operatively), and objectively via salivary cortisol (pre- and post-operatively). In addition, participating surgeons were asked to wear a Polar HR monitor throughout the study. In the observed case set, 23/55 cases were deemed stressful – as defined by an increase in the STAI between pre- and post-operative administration. Movements of HR and cortisol against stress self-reported by surgeons can be seen in Table 12.1. Perfect agreement between subjective and objective indicators was obtained in 15/23 stressful cases and 17/23 non-stressful cases. In these cases, elevated STAI scores were associated with elevated HR and cortisol levels (stressful cases) and decreased STAI scores were associated with normal HR and decreased cortisol levels (non-stressful cases). The study demonstrated that in 70% of cases, the raised STAI was mirrored by an increase in both objective parameters. Thus, the rise in HR and cortisol found is likely due to subjects’ mental stress (rather than, for instance, physical demands of carrying out the procedure). Further analyses revealed that HR was more sensitive (91%) and cortisol more specific (91%) in picking up mental stress. Using objective and subjective assessments of stress appears to be feasible and informative.
12.4 Part 3: Measures of Performance in Surgery Surgical performance encompasses technical and nontechnical aspects. The former cover the traditionally measured and assessed psychomotor skills of surgeons. The latter cover a set of behavioural (teamworking,
12 Surgical Performance Under Stress: Conceptual and Methodological Issues
communication and leadership) and cognitive skills (decision-making; situation awareness) that have been proposed as potential co-determinants (alongside technical skills) of surgical performance. In this section, we discuss direct measures and surrogate measures of technical performance, followed by assessment tools for non-technical skills.
[35, 44] and motion analysis using ICSAD [46]. Simulator-derived measures, especially error scores, have also been used: errors include inaccurate placement of object [34, 46], dropping an object [34, 44, 46], blood loss [35, 45], vessels ripped [35] and tissue damage [44].
12.4.2 Measures of Non-Technical Performance
12.4.1 Measures of Technical Performance Measures of technical performance consist of dexterity parameters (e.g. economy of motion and time taken [42]) and indicators of quality of performance (e.g. OSATS-based global rating scales [43]) as well as task specific measures (e.g. accuracy of stent placement). Various studies that have examined surgical stressors have used a range of performance measures/markers. Time taken to complete a task [32, 34, 44, 45] and average operative time [25, 26] have both been used. In laparoscopic surgery, measures of performance include number of knots tied [31, 33], economy of motion
Table 12.2 Non-technical performance assessment tools Tool Developer Development history Observational Teamwork Assessment for Surgery (OTAS)©
145
Table 12.2 summarises three of the most well-known measures of non-technical performance currently available for surgery. NOn-TECHnical Skill (NOTECHS) and non-technical skills for surgeons (NOTSS) assess performance at the level of the individual surgeon, whereas OTAS assesses performance within the surgical team (e.g. primary surgeon and assistant/camera holder). Of the available scales, only NOTECHS has been used in studies that have investigated surgical stress [45, 46, 55]. Communication and utterance frequency have also been used as a surrogate marker of non-technical performance [46].
Skills assessed Communication Cooperation/back up behaviour Coordination Leadership Monitoring behaviour
Clinical speciality
Individual vs. Team focus Surgical, anaesthetic, and nursing sub-teams
Imperial [48–51]
Developed for OR teams
NOn-TECHnical Skill (NOTECHS)
UoA [52] Imperial [53, 54]
Developed for aviation; revised for OR
Communication/ interaction Situation awareness Cooperation/team skills Leadership/ managerial skills Decision-making
Surgical anaesthetic nursing
Individual team members
Non-technical skills for surgeons (NOTSS)©
UoA [4, 55]
Developed for surgeons
Communication/ teamwork Leadership Situation awareness Decision-making
Surgical
Individual team members
Imperial, Imperial College London; UoA, University of Aberdeen
Surgical Anaesthetic Nursing
146
12.5 Part 4: Impact of Stress on Surgical Performance There is conflicting evidence regarding the stress levels in robotic vs. laparoscopic surgery. Studies that used skin conductance levels to measure stress found that stress was lower for robotic than for laparoscopic surgery [32, 34]. However, a study that used selfreported stress [56] found no difference between the two techniques in terms of mental stress (interestingly, both studies found that performance was worse for robotic than for laparoscopic surgery). In contrast, the evidence is consistent regarding stress in open vs. laparoscopic surgery: the former is less stressful than the latter [25, 31, 33]. In these studies, stress was assessed via self-report, skin conductance and eye blinks. Regarding performance, fewer knots were tied and operative time was longer [25] with laparoscopic surgery – thereby, suggesting that increased stress affected performance negatively. Expertise is also related to stress. Relevant studies suggest that experienced subjects have lower stress levels (as measured by HR [20, 23], HRV [25] and skin conductance, self-report and eye blinks [31]). These studies also show that experienced surgeons performed technically better than their less experienced counterparts. In another study, Moorthy et al. examined the effect of bleeding on technical and non-technical performance of experienced and inexperienced surgeons. Although the researchers did not obtain direct measures of stress, existing evidence suggests that bleeding is a key surgical stressor [38]. This study found that experienced surgeons were significantly better in controlling the bleeding. Taken together, these studies suggest that with expertise, stress levels become lower and technical performance improves. Studies that have investigated the effects of distractions/interruptions on technical performance [35, 44, 46] have shown that increased distractions correlate with poorer performance (increased task time, number of errors and poorer economy of motion) on difficult laparoscopic tasks. Moorthy et al. [46] also found that certain distractions, like noise in the OR and time pressure, were associated with significantly impaired dexterity and increased errors, when compared to quiet conditions. Finally, some studies have examined the impact of multiple stressors on surgical performance.[35, 45, 55]. In the study by Moorthy et al. [46] that we discussed
S. Arora and N. Sevdalis
above, multiple stressors (e.g. bleeding, time pressure and distractions) led to worse performance. Undre et al. [55] used bleeding as a stressor for surgeons and other stressors for the anaesthetic (e.g. difficult intubation) and nursing sub-teams (e.g. equipment missing from tray and inexperienced circulating nurse). Schuetz et al. [35] exposed their surgeons to multiple distractions and subsequently split them into three groups: those that experienced stress (measured via skin conductance) but did not recover, those that experienced stress but did recover and those that did not experience stress. The researchers found that the group who experienced stress but recovered demonstrated better dexterity than either the group with stress and no recovery or the group with no stress. Few studies have addressed issues relating to stress and non-technical skills in the OR. Moorthy et al. [45] used bleeding as a stressor and a revised version of NOTECHS for surgeons and did not obtain effects on the non-technical skill scales between their expert and novice subjects. Undre et al. used different stressors for different members of the OR team and used NOTECHS to assess non-technical performance in surgeons, anaesthetists and nurses. These researchers found that leadership and decision-making were scored lower than other skills. Finally, two studies have used interviews to assess surgeons’ own views about the effects of stress on their own performance [1, 38]. Both studies have shown that excessive stress leads to impaired communication, team working, judgement and decision-making.
12.6 Part 5: Discussion This chapter aimed to provide an overview of surgical performance under stress. We explored the concept and measures of stress applied to the domain of surgery, and discussed a range of tools to assess stress in the OR. Moreover, we discussed tools that have been used to assess surgical performance, technical and non-technical, in studies that have put surgeons under stress so as to investigate the impact of stress on their performance. Furthermore, we summarised some of the existing evidence on the impact of stressful conditions on surgeons in the OR and in surgical simulators. The overall picture that emerges from these studies is that surgeons are exposed to a range of performancedebilitating conditions. Technical issues (e.g. bleeding, technically demanding procedures in laparoscopic
12 Surgical Performance Under Stress: Conceptual and Methodological Issues
surgery) are a key stressor – but not the only one. Distractions and disruptions also trigger significant levels of stress, as does lack of expertise. Importantly, whereas technical issues cannot always be predicted and expertise can only be acquired relatively slowly over a number of years in training, distractions are not entirely unavoidable. In many ORs, levels of distraction and interruption to surgical work are not negligible and some existing evidence suggests that they both cause stress and have a negative impact on surgical performance [47, 50]. Redesign of the work environment in such a way that distraction is minimised is an option that could be considered in new hospitals. Alternatively, existing work processes could be assessed for their functionality and redesigned, so that levels of disruption to the OR team through, for instance, external requests, are kept to an unavoidable minimum. More generally, the current review of stressful conditions under which surgeons are often asked to operate [57] raises a number of issues in relation to surgical training. In addition, in the light of the evidence that we have reviewed, recommendations can be made in terms of further research that is necessary to elucidate a number of issues. In what follows, we address these implications for training and research.
12.7 Implications for Surgical Training Current surgical training programmes provide very little opportunity for surgeons to recognise and respond to stress before it becomes deleterious to their practice. Surgeons are typically taught to perform a procedure in routine circumstances despite the evidence that stressors do occur in the OR – a situation less than ideal for inexperienced surgeons. Senior surgeons have learnt from anecdotal experience how to deal with potentially stressful situations, but the juniors do not have this benefit. From a training perspective, this situation is neither safe nor acceptable. Preconditioning to stress (i.e. experiencing stress before facing it in real OR circumstances) “confers a well-documented influence on the cardiovascular response and alters a subject’s approach to psychological challenges” [22, 37]. In fact, it has been suggested that senior surgeons respond better to stressful conditions in the OR because they have been preconditioned through their experience [22].
147
Given the reduced training time for junior surgeons, other routes to stress preconditioning should be sought. Simulation-based training offers a potential training route. Simulations provide a safe training environment, in which errors can occur safely, and systematic feedback on technical and non-technical performance can be provided to trainees. Simulation-based crisis management modules have been widely used in the aviation industry as part of crew resource management training [58] and more recently, they have been introduced in the training of junior anaesthetists. Lessons learnt in the context of the aviation industry and anaesthesia could be applied to surgery. Current evidence suggests that such modules are well received from trainers and trainees alike [45, 55]. How should a simulation-based stress training module be designed? Arora et al. [1] report a systematic investigation of what such a module should be based on. In detailed semi-structured interviews with expert surgeons, Arora et al. explored key coping strategies used by surgeons in stressful situations and their requirements from an intervention to enhance their performance under stress. The most common coping strategy employed was internal to the surgeon (i.e. cognitive control of one’s own stress). A key element of the surgeons’ response was to stop what they were doing, so as to be able to stand back and re-assess the situation. According to one interviewee, a stressful situation “can lead to a cascade of further errors from that point. If you remain stressed, you start panicking and blaming others. Instead of trying to control and contain the situation, it starts cascading out of control which leads to further errors and ultimately a serious incident …” (s13). Most of the interviewed surgeons (10/15) also mentioned that they used some form of pre-operative planning to minimise potential for intra-operative stress: “if you’re not prepared before you’re going to run, you’re guaranteed to run into troubles during the operation” (s6). This includes mental rehearsal (“I sort of play the operation each step backwards and forwards in my mind” (s6) ), contingency planning (“before you start, get one or two things ready in case serious things go wrong…then you have a get out clause, to help you out of trouble” (s7) ), and practicing crisis scenarios beforehand (“you have a game plan to deal with those scenarios” (C12) ). Working as a team was also highlighted as a crucial response: “create a rapport with theatre staff at all levels”
148
S. Arora and N. Sevdalis
(s13); “if you’re out of your comfort zone; communicate this to others” (s7). This is important because if the surgeon runs into difficulties, others “wake up and pay extra attention too” (c13). Timely acknowledgement of stress was very well put by one surgeon: “during a life threatening situation, I actually stop what I am doing and turn to talk to the scrub nurse. I look into her eyes and say this is a life threatening situation now, we’re in trouble here, we need as much help as possible … this is serious, I will be asking for a lot of things and may say them in a haphazard manner” (s1). Arora et al’s study also examined in detail the components of a stress training module that the surgeons would find useful. These are summarised in Table 12.3. Two main categories of such a module were identified. First, components that would reduce the stress and second, components that would improve surgeons’
ability to manage stress. Factual information about stress, cognitive training, team training and individualised feedback should be provided as part of stress training. Realistic simulation-based training, with individualised feedback on technical and non-technical skills, was favoured by most participating surgeons. Finally, most participants’ view was that the training should be targeted at Higher Surgical Trainees/ Residents
12.8 Future Research Agenda The empirical evidence on stress and surgical performance is rather sparse and the studies heterogeneous in their stress-inducing conditions and measurement
Table 12.3 Components of a stress training intervention Components of intervention Number of surgeons who mentioned this (n = 15) Cognitive training
Importance of intervention components
11
Raising awareness of own coping method
Most important
Technique to improve coping Learning from expert Technical training
11
Provide experience Use simulation Specific skills i.e. when to call for help, control bleeding Measuring outcomes
12
Objective marker of stress Effect of stress on performance Performance before and after intervention Feedback
13
Feedback and debrief after scenario Videos Opportunity for reflection Team training Dealing with difficult colleague briefing Improving communication Managing others and the environment when under stress
8 Least important
12 Surgical Performance Under Stress: Conceptual and Methodological Issues
instruments used. Further research should systematically and quantitatively investigate consequences of excessive levels of stress in the OR (as seen in a crisis situation) on surgical performance – thereby, contributing to the delineation of effective stress management strategies and training needs at different levels of expertise. This research should also address whether increasing experience in a procedure reduces stress and, subsequently, enhances performance (i.e. quantitative assessment of relevant learning curves for performance and stress). Moreover, instruments and tools need to be developed and validated if reliable evidence is to be collected on the human factors side of stress. To assess stress, both subjective (i.e. self-report) and objective (i.e. physiological) measures are necessary, as these capture subjective experience and bodily response to stressful OR conditions. Comprehensiveness and robustness in assessment should be balanced with simplicity and ease of use in real OR context to facilitate real world application. Furthermore, the impact of stress on non-technical skills remains largely unexplored. These skills are likely to deteriorate as a result of excessive stress. Conversely, adequate training in non-technical skills (such as effective team work and mental readiness) may prove an effective coping strategy. Both these questions should be addressed empirically in further research. In addition, although excessive stress can compromise performance, a small amount of it can help concentration and alertness, as is evident in the Yerkes–Dodson law [17]. Research should seek to determine where this optimal level of stress lies for surgery and investigate how well surgeons actually cope with stress (in addition to just assessing stress levels across procedures, levels of expertise, etc.). Determining optimal levels of stress for novices and experts can form the basis for training programmes designed to reduce the deleterious effects of stress on surgical practice, ultimately enhancing the quality and safety of patient care. Acknowledgements This chapter is based on an ongoing research programme on safety implications of surgical stressors that is being carried out by our research group. Dr. Roger L. Kneebone has had an instrumental role in the shaping and development of this work over a number of years. The authors would like to thank the BUPA Foundation and the Economic and Social Research Council (ESRC) Centre for Economic Learning and Social Evolution for providing funding for the work reported in this chapter.
149
References 1. Arora S, Sevdalis N, Nestel D et al (2009) Managing intraoperative stress: what do surgeons want from a crisis training programme? Am J Surg 197:537–43 2. Department of Health (2008) A high quality workforce: NHS next stage review. In: A high quality workforce: NHS next stage review. Department of Health, London 3. Aggarwal R (2004) Surgical education and training in the new millennium. Surg Endosc 18:1409 4. Yule S, Flin R, Paterson-Brown S et al (2006) Non-technical skills for surgeons in the operating room: a review of the literature. Surgery 139:140–149 5. Linda K, Janet C, Molla D (eds) (2000) To err is human: building a safer health system. National Academy Press, Washington, DC 6. Vincent C, Neale G, Woloshynowych M (2001) Adverse events in British hospitals: preliminary retrospective record review. BMJ 322:517–519 7. Vincent C, Moorthy K, Sarker SK et al (2004) Systems approaches to surgical quality and safety: from concept to measurement. Ann Surg 239:475–482 8. Selye H (1973) The evolution of the stress concept. Am Sci 61(6):692–699 9. Everly GS Jr, Lating JM (2002) A clinical guide to the treatment of the human stress response, 2nd edn, Kluwer Academic/Plenum Publishers, New York 10. Lazarus R S (1966) Psychological Stress and the coping process. McGraw-Hill, New York 11. Selye H (1978) The stress of life (revised edition). Mcgraw Hill, Oxford 12. Lazarus RS (1991) Emotion and adaptation. Oxford University Press, New York 13. Lazarus RS, Folkman S (1984) Stress, appraisal and coping. Springer, New York 14. Lazarus RS, Folkman S (1986) Cognitive theories of stress and the issue of circularity dynamics of stress: physiological, psychological, and social perspectives. Plenum Press, New York, pp. 63–80 15. Lazarus RS (1993) From psychological stress to the emotions: a history of changing outlooks. Annu Rev Psychol 44:1–21 16. Lazarus RS (1993) Coping theory and research: past, present, and future. Psychol Med 55:234–247 17. Yerkes RM, Dodson JD (1908) The relation of strength of stimulus to rapidity of habit formation. J Comp Neurol Psychol 18:459–482 18. Becker W, Ellis H, Goldsmith R et al (1983) Heart rates of surgeons in theatre. Ergonomics 26:803–807 19. Czyzewska E, Kiczka K, Czarnecki A et al (1983) The surgeon’s mental load during decision making at various stages of operations. Eur J Appl Physiol Occup Physiol 51: 441–446 20. Kikuchi K, Okuyama K, Yamamoto A et al (1995) Intraoperative stress for surgeons and assistants. J Ophthalmic Nurs Technol 14:68–70 21. Payne R, Rick J (1986) Heart rate as an indicator of stress in surgeons and anaesthetists. J Psychosom Res 30:411–420 22. Tendulkar AP, Victorino GP, Chong TJ et al (2005) Quantification of surgical resident stress “on call”. J Am Coll Surg 201(4):560–564
150 23. Yamamoto A, Hara T, Kikuchi K et al (1999) Intraoperative stress experienced by surgeons and assistants. Ophthalmic Surg Lasers 30:27–30 24. Anonymous (1996) Heart rate variability: standards of measurement, physiological interpretation and clinical use. Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. Circulation 93:1043–1065 25. Bohm B, Rotting N, Schwenk W et al (2001) A prospective randomized trial on heart rate variability of the surgical team during laparoscopic and conventional sigmoid resection. Arch Surg 136:305–310 26. Demirtas Y, Tulmac M, Yavuzer R et al (2004) Plastic surgeon’s life: marvelous for mind, exhausting for body. Plast Reconstr Surg 114:923–931; discussion 932–923 27. Bootsma M, Swenne CA, Van Bolhuis HH et al (1994) Heart rate and heart rate variability as indexes of sympathovagal balance. Am J Physiol 266:H1565–1571 28. Pagani M, Furlan R, Pizzinelli P et al (1989) Spectral analysis of R-R and arterial pressure variabilities to assess sympatho-vagal interaction during mental stress in humans. J Hypertens Suppl 7:S14–15 29. Yamamoto Y, Hughson RL, Peterson JC (1991) Autonomic control of heart rate during exercise studied by heart rate variability spectral analysis. J Appl Physiol 71:1136–1142 30. Boucsein W (1992) Electrodermal activity. Plenum Press, New York 31. Berguer R, Smith WD, Chung YH (2001) Performing laparoscopic surgery is significantly more stressful for the surgeon than open surgery. Surg Endosc 15:1204–1207 32. Berguer R, Smith W (2006) An ergonomic comparison of robotic and laparoscopic technique: the influence of surgeon experience and task complexity. J Surg Res 134:87–92 33. Schuetz M, Gockel I, Beardi J et al (2007) Three different types of surgeon-specific stress reactions identified by laparoscopic simulation in a virtual scenario. Surg Endosc 34. Smith WD, Chung YH, Berguer R (2000) A virtual instrument ergonomics workstation for measuring the mental workload of performing video-endoscopic surgery. Stud Health Technol Inform 70:309–315 35. Smith WD, Berguer R, Rosser JC Jr (2003) Wireless virtual instrument measurement of surgeons’ physical and mental workloads for robotic versus manual minimally invasive surgery. Stud Health Technol Inform 94:318–324 36. Jezova D, Slezak V, Alexandrova M et al (1992) Professional stress in surgeons and artists as assessed by salivary cortisol. Gordon & Breach Science Publishers, Philadelphia 37. Kelsey RM, Blascovich J, Tomaka J et al (1999) Cardiovascular reactivity and adaptation to recurrent psychological stress: effects of prior task exposure. Psychophysiology 36:818–831 38. Wetzel CM, Kneebone RL, Woloshynowych M et al (2006) The effects of stress on surgical performance. Am J Surg 191:5–10 39. CD Speilberger, RL Gorsuch, Lushene R (1970) STAI manual. Consulting Psychologist Press, Palo Alto 40. Tanida M, Katsuyama M, Sakatani K (2007) Relation between mental stress-induced prefrontal cortex activity and skin conditions: a near-infrared spectroscopy study. Brain Res 1184:210–216 41. Speilberger CD, Gorsuch RLREL (1970) STAI manual. Consulting Psychologist Press, Palo Alto
S. Arora and N. Sevdalis 42. Aggarwal R, Grantcharov T, Moorthy K et al (2006) A competency-based virtual reality training curriculum for the acquisition of laparoscopic psychomotor skill. Am J Surg191:128–133 43. Grantcharov TP, Bardram L, Funch-Jensen P et al (2002) Assessment of technical surgical skills. Eur J Surg 168:139–144 44. Hassan I, Weyers P, Maschuw K et al (2006) Negative stresscoping strategies among novices in surgery correlate with poor virtual laparoscopic performance. Br J Surg 93: 1554–1559 45. Moorthy K, Munz Y, Forrest D et al (2006) Surgical crisis management skills training and assessment: a simulation[corrected]-based approach to enhancing operating room performance. Ann Surg 244:139–147 46. Moorthy K, Munz Y, Dosis A et al (2003) The effect of stress-inducing conditions on the performance of a laparoscopic task. Surg Endosc 17:1481–1484 47. Sevdalis N, Lyons M, Healey AN et al (2008) Observational teamwork assessment for surgery© (OTAS©): construct validation with expert vs. novice raters. Ann Surg 249:10471051 48. Undre S, Healey AN, Darzi A et al (2006) Observational assessment of surgical teamwork: a feasibility study. World J Surg 30:1774–1783 49. Undre S, Sevdalis N, Healey AN et al (2007) Observational teamwork assessment for surgery (OTAS): refinement and application in urological surgery. World J Surg 31:1373–1381 50. Undre S, Sevdalis N, Vincent CA (2009) Observing and assessing surgical teams: the observational teamwork assessment for surgery© (OTAS)©. In: Flin R, Mitchell L (eds) Safer surgery: analysing behaviour in the operating theatre. Ashgate, Aldershot 51. Flin R, Maran N (2004) Identifying and training non-technical skills for teams in acute medicine. Qual Safe Health Care 13(Suppl 1):i80–i84 52. Moorthy K, Munz Y, Adams S et al (2005) A human factors analysis of technical and team skills among surgical trainees during procedural simulations in a simulated operating theatre. Ann Surg 242:631–639 53. Sevdalis N, Davis R, Koutantji M et al (2008) Reliability of a revised NOTECHS scale for use in surgical teams. Am J Surg 196(2):184–190 54. Yule S, Flin R, Maran N et al (2008) Surgeons’ non-technical skills in the operating room: reliability testing of the NOTSS behavior rating system. World J Surg 32:548–556 55. Undre S, Koutantji M, Sevdalis N et al Multidisciplinary crisis simulations: the way forward for training surgical teams. World J Surg 31:1843–1853 56. Lee EC, Rafiq A, Merrell R et al (2005) Ergonomics and human factors in endoscopic surgery: a comparison of manual vs telerobotic simulation systems. Surg Endosc 19:1064–1070 57. Sevdalis N, Arora S, Undre S et al (2009) Surgical environment: an observational approach. In: Flin R, Mitchell L (eds) Safer surgery: analyzing behaviour in the operating theatre. Ashgate 58. Helmreich RL, Merritt AC, Wilhelm JA (1999) The evolution of crew resource management training in commercial aviation. Int J Av Psych 9:19–32
13
How can we Assess Quality of Care in Surgery? Erik Mayer, Andre Chow, Lord Ara Darzi, and Thanos Athanasiou
Abbreviations
Contents Abbreviations ..................................................................... 151 13.1
Introduction ............................................................ 151
13.2
Quality of Care ....................................................... 152
NHS U.K. U.S.
National Health Service United Kingdom United States
13.2.1 Defining Quality of Care .......................................... 152 13.2.2 How Should we Assess Quality of Care?................. 152 13.3
Measuring Quality of Care .................................... 154
13.3.1 13.3.2 13.3.3 13.3.4
Structural Variables .................................................. Process Measures ..................................................... Outcome Measures ................................................... Health care Economics .............................................
13.4
Benchmarking Quality of Care ............................. 158
154 155 156 157
13.4.1 Current Initiatives ..................................................... 158 13.4.2 Pay for Performance Strategies ................................ 159 13.4.3 Future Direction ....................................................... 160 13.5
Public Health Implications .................................... 160
13.6
How to Design Health Care Quality Reforms ..... 161
13.7
How to Achieve Health Care Quality Improvement .......................................................... 162
13.8
Conclusions ............................................................. 162
References ........................................................................... 163
Abstract This chapter explores and outlines existing research in the area of quality of care and identifies methods by which future research should be conducted. Before we try and assess quality of care, we must first be able to define it, although this in itself is complicated by the complexity of interacting factors that determine quality health care. The characterisation by a conceptual model of structure-process-outcome and the importance of health economics is discussed. The proposed attributes of health care which can define its quality are also presented. Existing initiatives that benchmark quality of care have a tendency to be generic and give us only an indication of minimum standards. The components of future assessment tools for quality of care are proposed along with how “frontline” quality improvement can be facilitated by conceptual frameworks for designing health system reforms and engaging contemporary managerial capabilities.
13.1 Introduction
E. Mayer () Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St Mary’s Hospital Campus, Praed Street, London, W2 1NY, UK e-mail:
[email protected] The provision of high quality care is the universal aim of any health care system and those that work within it. When the health service is working at its best, it can provide excellent care to our patients, and it is well recognised that high quality care can lead to high
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_13, © Springer-Verlag Berlin Heidelberg 2010
151
152
quality results. However, this is not always achieved. There exist wide variations in the quality of surgical care provision. This variation occurs between countries, regions, hospitals, departments and surgeons. The delicate interaction of multiple factors at numerous stages of a patient’s care pathway means that any single suboptimal episode can result in a cascade effect on the overall quality of care, to the detriment of the person who matters most, our patient. It is therefore imperative that there is a focus on continually improving the quality of care that is delivered. In order to do this, the quality of care must first be defined so that its integral components are understood. Only then can methods to assess and measure quality of care be developed. Simultaneously, the very act of measurement of care can be used to benchmark standards and drive continuing improvement. By showing that high quality care is provided, the health care system can demonstrate that it is doing its best to ensure the patient’s well-being and continuing health. Demonstrating high quality care can boost confidence in the health system both for patients and clinicians. It helps to highlight areas that are performing well, and also areas that need more attention. It helps to direct funds and resources towards the areas of care that need it most. Despite all of the universally accepted benefits to determining the quality of care that our patients receive within surgery, there is currently no universally accepted and/or validated measurement system available.
13.2 Quality of Care 13.2.1 Defining Quality of Care The Institute of Medicine in the United States (U.S.) define quality of care as: “ … the degree to which health services for individuals and populations increase the likelihood of desired health outcomes and are consistent with current professional knowledge” [1].
The American Medical Association define high quality care as: “[that] which consistently contributes to the improvement or maintenance of quality and/or duration of life” [2].
Or the definition can incorporate a patient-orientated emphasis such as that of BUPA Hospitals United Kingdom (U.K.) [3]:
E. Mayer et al. “ … ability to provide the service you want and need resulting in medical treatment you can rely on and personal care you’ll appreciate”
It is clear from these definitions that the term “quality care” may imply different things to both clinicians and patients. From the clinician point of view, high quality care means up-to-date, evidence-based patient care that results in improved clinical outcomes. Although this is also important to the patient, they may be more concerned with aspects of care such as availability, flexibility, reliability and personal touches such as politeness and empathy of medical staff. The term “quality of care” can, therefore, be broadly defining to represent an overall impression of delivery of health care, but equally require some very specific and agreed measures of the treatment process or outcomes achieved. This makes it a complex entity to encompass, requiring an ordered approach.
13.2.2 How Should we Assess Quality of Care? Although current definitions of quality of care are applicable, they are deliberately vague and therefore of limited use in defining the assessment of quality of care. Although it can be obvious when high quality care is being provided, providing objective proof of this can be more challenging. The inherent flexibility of health care provision must also be considered; with innovation in medical technology and treatments, optimum care standards and therefore the markers of the quality of care evolve. It is therefore easy to see why standardising the assessment of quality of care even at a procedural level can be problematic. Quality of care assessment should include every aspect of a patient’s journey through their health care system. This would encompass community care, screening where applicable, referral to a specialist, processes of investigation, diagnosis and treatment. Also needed are details of post-operative management and follow-up, both in the hospital and in the community. In other words, the assessment of quality should be multifactorial. It is clear that there are countless variables which could be measured. How then to either measure all of them or identify the most pertinent ones? In 1966, Donabedian divided quality of care into three tangible parts: structure, process and outcome [4] (Fig. 13.1). Structure is concerned with the actual infrastructure of the health care system. This includes aspects
13
How can we Assess Quality of Care in Surgery?
Structure
Process
Outcome
Fig. 13.1 As defined by Donabedian, quality of care is defined by an interaction of three key elements
such as the availability of equipment, availability and qualifications of staff and administration. Process looks at the actual details of care including aspects from diagnostic tests, through to interventions such as surgery and continuity of care. Outcome looks at the end result of medical care, traditionally in the form of survival, and restoration of function. This model has been widely applied to the assessment of quality of care.
153
A fourth element can be proposed for this conceptual model: health care economics or its dependent measure productivity. In modern medicine, the availability of financial resources and any resulting financial constraints can impact the accessibility and delivery of health care services and potentially therefore the provision of high quality care. This is particularly true of publicly funded systems, such as the U.K.’s National Health Service (NHS). The ability of a health care provider to deliver a quality of care operating within financial or resource restrictions is an important factor that must be considered. The notion of cost being associated with quality of care, however, is not novel; Donabedian following on from his structure-process-outcome paradigm described several attributes of health care which define its quality, the “seven pillars of quality”. One of these “efficiency” relates to the “ability to obtain the greatest health improvement at the lowest cost” [5]. Although Donabedian introduces the concept of cost within his quality attributes, the organisation of a contemporary health care service is so influenced by business planning that it directly shapes it. For this reason, a strong argument can be made for health care economics to be included in a conceptual model of quality of care. It could also be argued that health care economics forms part of structure and does, therefore, not require special attention. As described above, Donabedian proposes “seven pillars” or attributes of health care which can define its quality – Fig. 13.2. He then went on to describe 11 principles of quality assurance, which are essential for
1. Efficacy
The ability of care, at its best, to improve health
2. Effectiveness
The degree to which attainable health improvements are realised
3. Efficiency
The ability to obtain the greatest health improvement at the lowest cost
4. Optimality
The most advantageous balancing of costs and benefits
5. Acceptability
Conformity to patient preferences regarding accessibility, patient-doctor relationship, amenities, effects of care and cost of care
6. Legitimacy
Conformity to social preferences concerning all of the above
7. Equity
Fairness in the distribution of care and its effects on health
Fig. 13.2 The seven pillars of quality as defined by Donabedian. Adapted from reference [5]
Quality
154
the design, operation and effectiveness of care [6]. The complexity of quality of care as an entity makes its assessment particularly challenging. Donabedian’s structure-process-outcome model acts as a suitable foundation and has been widely applied to the assessment of quality of care. Paying attention to each of these determinants of quality of care will allow us to build a framework consisting of a number of components that are representative of the quality of care that our patients receive
13.3 Measuring Quality of Care 13.3.1 Structural Variables The structure of surgical care can be thought of as the “bare bricks” or infrastructure of care. It is involved with details such as equipment, number of beds, nurseto-patient ratios, qualifications of medical staff and administrative structure. It is thought that if surgery occurs in a high quality setting, surely high quality care should follow. An advantage of measuring structural variables is that the information required is usually fairly reliable and used frequently at a hospital managerial board level. It is, however, infrequently used in a more clinical and/or public domain to help inform the environment in which surgical care is delivered. Logically though, we need to be certain what correlation exists between these structural variables and quality of care, but this is not well established. Brook et al. have assessed the relationship between patient, physician and hospital characteristics and the appropriateness of intervention for carotid endarterectomy, coronary angiography and upper gastrointestinal endoscopy [7]. They concluded that the appropriateness of care could not be reliably predicted from standard, easily obtainable data about the patient, the physician or hospital structural variables. However, coronary angiography and carotid endarterectomy were significantly more likely to be carried out for medically appropriate reasons if performed in a teaching hospital. Hospital teaching status and other associated hospital variables such as size or for-profit status did not, however, translate into lower post-operative complication or death rates following carotid endarterectomy [8].
E. Mayer et al.
A structural variable that has received more attention than most is the institutional or surgeon volume: the volume–outcome relationship. In this scenario, the volume of patients treated is used as a proxy for the quality of care, and then the correlation to important clinical outcome measures is determined. On the basis of a large number of studies that show better outcomes for patients treated at high-volume institutions and/or by high-volume surgeons, we are seeing a trend of preferential patient referral to high-volume institutions. Promoters of this centralisation of services in the U.S. argue that it is important to help advance the quality of health care, e.g. Leapfrog group [9]. Similarly in the U.K., centralisation of oncological services is identified in the Department of Health’s Improving Outcomes Guidance framework [10]. Institutional and surgeon volume either independently or in combination are, nevertheless, rather broad proxy measures for quality of care. Indeed, some low-volume providers have excellent outcomes and some high-volume providers poor outcomes. Research in this area has resultantly more recently started to better understand the core factors that determine whether or not the institution or surgeon produces better outcomes. Figure 13.3 illustrates potential structural variables for which volume acts as proxy measure and may therefore better inform us of quality of care. In order to determine which structural variables potentially have the most influence on the quality of care, we first need to determine if correlation exists between them and some dependent endpoint; research to date has used clinical outcome measures. Elixhauser et al. [11] demonstrated the importance of the ratio of doctors and nurses per bed number, irrespective of the institutional volume, on the mortality rates for paediatric heart surgery. A systematic review published by Pronovost et al. [12] also showed that high-intensity intensive care unit (ICU) physician staffing was associated with reduced hospital and ICU length of stay and mortality. Treggiari et al. [13] demonstrated in a multi-centre study that ICUs that were run by, or associated with, a specialised intensivist had significantly lower mortality rates in patients with acute lung injury (odds ratio = 0.68; 95% CI 0.52–0.89). This association was independent of severity of illness and consultation by a respiratory physician. As we better understand the core structural variables that correlate with markers of outcome, it will enable us to then assess the degree to which they also
13
How can we Assess Quality of Care in Surgery?
155
Staff:Patient ratio Senority/Junior ratios Workforce
Teaching/Specialist
Productivity
University Affiliated Community
Locum Agency Spend Institution Status Annual Net I&E surplus
Urban/Rural
Trust income per spell
Research & Development
Finance
Structural Variables
Monthly run rate variability
Elective Acute
Bed Occupancy Rates
ITU
Availability of Diagnostics
Activity
Main Theatres Theatre Utilisation Resources
Day Surgery Unit
On-site speciality skill-mix
Technology implementation
Fig. 13.3 Examples of potential structural variables which influence quality of care
influence the quality of care that our patients receive. It is not unrealistic to imagine that integration of the structural variables, demonstrated to improve quality of care, into institutions irrespective of their caseload volume could further our aim of achieving equality of outcomes for all.
13.3.2 Process Measures In surgery, process of care can be thought of as the preoperative, intra-operative and post-operative management of a patient. It looks at what is actually done for, and to, the patient. For example, it can look at the availability of screening programmes, the appropriate use of diagnostic tests, waiting times to operation, discharge processes, post-operative follow-up and availability of and willingness to give adjuvant treatments. However, the measured processes are only useful if there is evidence to prove that they translate into improved patient care. There is little point, for instance, in ensuring that all patients prior to laparoscopic cholecystectomy have an MRI, when no clinical benefit will
be gained. Malin et al. [14] assessed the quality of care for breast and colorectal cancer in the United States. They reviewed existing quality indicators, guidelines and review articles, and peer reviewed clinical trials to produce a list of explicit quality measures evaluating the management of newly diagnosed breast and colorectal cancer patients. These were then checked for validity by a panel of experts, and included areas such as diagnostic evaluation, surgery, adjuvant therapy, management of treatment toxicity and post-treatment surveillance. Data were extracted from the patients’ notes and via patient questionnaires. Overall adherence to 36 and 25 quality measures specific to the process of care was 86% (95% CI 86–86%) for breast cancer and 78% (95% CI 77–79%) for colorectal cancer patients, respectively. What was unique in this approach is that the group was not trying to correlate process with outcomes, but simply looking at process measures that were agreed, in this instance by evidence base and expert review, to reflect quality of care. There are a number of potential benefits to the measurement of process as opposed to outcomes in assessing the quality of care. Lilford et al. [15] describe these in detail, but in brief; process measures are less
156
susceptible, although not exempt, to case-mix bias; their assessment and subsequent improvement will positively reflect on the entire evaluated institutional patient population as opposed to a few outliers with poor outcomes; deficiencies in processes of care are less likely to be seen as a label of poor performance, but instead indicate when and how improvement can be made and process measures are reflective of the current state of care as opposed to the time delay which is experienced with some outcome measures. Process measurement is not easy. It is difficult to standardise measurements for process of care in surgery, as the process varies depending upon the surgical pathology. Creating a standard quality measure for all surgery may be impossible. It is more feasible to create measures of process of care for specific pathologies. Examples of where best practice has been defined include diseases with published national guidelines such as those produced by the National Institute of Clinical Excellence for cancer of the breast and lung. In the absence of agreed national guidelines, ongoing clinically-based research will help us to define evidence-based processes that improve quality of care. The appropriate use of surgical services is an important process measure in that it not only acts as a very good measure of the quality of care that a patient receives, but also has repercussion on the use of health care resources and the economics of health care. For a surgical intervention to be appropriate, it must “have a health benefit that exceeds its health risk by a sufficiently large margin to make the intervention worth doing”. The RAND Corporation has published extensively on the topics of the overuse and underuse of health care services. One of their largest studies examined the appropriateness of coronary angiographies, carotid endarterectomy and upper gastrointestinal endoscopy across multiple regions of the United States [16]. It found that 17, 32 and 17% of these procedures, respectively, were performed for inappropriate clinical reasons (i.e. overuse). Extrapolation from the literature may indicate that one quarter of hospital days, one quarter of surgical procedures and two fifths of medications are inappropriately overused [17]. Measuring the process of care can, however, be incredibly labour- and time-intensive and will require significant clinical knowledge. There will be a multitude of measurements, which can either be obtained prospectively, or gleaned retrospectively from patient notes. The introduction of electronic coding of patient
E. Mayer et al.
records may make this an easier task in the future. A recent Cochrane review did find that process measurements in the form of audit are effective in improving clinical practice [18]. However, the costs of measuring process will ultimately have to be weighed against the patient benefit that is gained from any actions taken as a result of those measurements. In summary, there is no doubt that for the majority of surgical conditions, measuring the process of care will provide us with substantive and up-to-date indicators of quality of care and will be directly influenced by the functionality of a health care provider.
13.3.3 Outcome Measures Traditionally, quality of care has been judged on outcome measures using surrogate endpoints such as mortality and morbidity. They are used because they are easy to measure and recorded with regularity. For an outcome measure to be a valid test of quality, it must be linked to and correlate with known processes that when changed will accordingly alter that outcome measure. For example, knowing the number of patients who present with metastases within six months of diagnosis with inoperable liver cancer is an important prognostic outcome. However, it is not a compelling measure of quality, as our ability to influence it is limited. There are many advantages of using outcomes as a measure of quality of care. Outcomes are wellestablished as an important feature of quality. They can be viewed as the overall effect of care on a patient. Few would doubt the validity of outcomes such as mortality in judging surgical care. Statistics such as mortality rates are understandable at face value, including to the lay person. Consequently there is a natural tendency to rank hospitals according to outcome measures such as mortality rates, with an implied association with quality of care. Examples of organisations that produce rankings according to outcome measures such as mortality rates include the Leapfrog group [9] and the U.S. News “Americas Best Hospitals” in the U.S. [19], as well as Dr Foster Intelligence which produces the “Good Hospital Guide” [20] and the Health care Commission in the U.K. [21]. The use, however, of solely outcome measures as an indicator of quality of care can be gravely misleading and inappropriate. Outcomes, as implied earlier,
13
How can we Assess Quality of Care in Surgery?
are the “end result” of an entire patient pathway and thus reliant on numerous other variables. The outcome measure itself may not, therefore, directly correlate with the quality of care. For example, if a patient who has undergone an operation to remove a colorectal cancer dies 1 year after surgery, can we say that he has had poor quality of care? He or she may have received all the best treatments available as guided by the latest evidence-based medicine and despite all of this, died. Should this patient’s death be assumed to indicate a poor quality of care from the surgical team and allied health care professionals? Sometimes the best quality of care in surgery may still result in mortality through circumstances beyond our control. Equally though, the patient may have received the best quality of care throughout their hospital admission, but poor followup surveillance and delays in adjuvant treatments may have impacted on the final outcome. Other factors such as the natural history of the disease, the patient’s age and co-morbidities often have much larger influences on outcome than surgical care. The method of risk adjustment attempts to compensate for the difference in case-mix between surgical centres. However, the effect of case-mix can never completely eradicated. Firstly, risk adjustment cannot allow for variables that are unmeasured, or not known. Neither can it adjust for the effects of varying definitions (such as the definitions of a surgical site infection) between centres. Risk adjustment can cause increased bias if the risk of that measured factor is not uniform across the compared populations. This is why, even after adjusting mortality rates for risk, the relationship between quality of care and outcomes such as mortality is inconsistent. Pitches et al. looked at the relationship between risk-adjusted hospital mortality rates and quality or processes of care [22]. They found that a positive correlation between better quality of care and risk-adjusted mortality was found in under half of the papers examined, while the others showed either no correlation or a paradoxical correlation. Similarly, Hofer and Hayward [23] found that the sensitivity for detecting poor quality hospitals based upon their risk-adjusted mortality rates was low at only 35%, while the positive predictive value was only 52%. This work has been corroborated with similar models by Zalkind and Eastaugh [24] and Thomas and Hofer [25]. Traditional outcome measures fail to appreciate important patient-specific measures such as quality of life. Recently, the National Institute of Clinical
157
Excellence has taken the quality-adjusted life year into account against a negative financial outlay when deciding to recommend the use of novel oncological medications such as herceptin [26]. There has also been increasing interest in measuring national patient reported outcome measures (PROMS) as a way of measuring health care performance [27]. In the U.K. pilot studies have been completed and the final report on PROMS is due shortly. Outcomes remain an important method of quality assessment, despite their significant limitations around risk adjustment. The use of outcome measures in isolation is clearly inappropriate, and in order to improve the use of outcomes as a measure of quality, a multidimensional approach including not only traditional measures such as mortality and morbidity, but also more patient reported outcomes such as quality of life, pain scores and so forth, should be encouraged. This will help us to include more patient-centred measures into a currently clinically dominated quality of care assessment.
13.3.4 Health care Economics High quality of care invariably requires significant resources. A limitation on available resources or financial constraint can be an inhibitor to producing the highest quality of care. With new technologies usually having initial premium costs and increasing levels of demand from a more educated and ageing population, health care costs will continue to increase in the future. In the U.S., health care expenditure is determined by private insurance companies, and in the U.K., policy has given control to regional strategic health authorities. This “local” budget control can and has lead to geographical health care inequality. In the U.K., this has been termed the “postcode lottery”: the treatments that you are eligible to receive can be dependent upon the area in which you live in accordance with the financial position of that area. Indeed the U.K.’s Department of Health has taken this one step further and begun to look at expenditure in a number of different areas of health care and correlated this to outcome data [28] (Fig. 13.4). In economic terms, productivity is defined as the amount of output created per unit of input. Until recently, NHS productivity was determined using cost
158
E. Mayer et al.
Area A has a relatively low spend
Area B has a relatively high spend
Area A has a relatively high mortality Area B has a relatively low mortality
Fig. 13.4 Programme budgeting – circulatory system programme budget per capita, expenditure (million pounds) per 100,000 unified weighted population, 2004–2005 vs. Mortality from all circulatory diseases, DSR, all ages, 2002–2004, persons.
Area A has a low spend per capita and a corresponding high mortality rate. The reverse is true for Area B (reproduced with permission from Department of Health/Crown copyright)
(input) and volume measures (output). Volumes of treatment measures such as GP appointments, ambulance journeys and operations, taken from the National Accounts [29], were taken as clear indicators of how much work the NHS does. This is obviously an oversimplified approach, which ignores important aspects such as quality of care. Due to increasing costs, productivity has been seen to fall in recent years between 0.6 and 1.3% per annum [29]. However, if NHS output is adjusted to account for increased quality of care as well as the increasing value of health, NHS productivity actually demonstrates an increase in productivity from 0.9 to 1.6% per annum [29, 30]. Thus, we can see how understanding quality of care can have economic benefits as well as increasing public satisfaction. Although new surgical technologies are usually associated with a higher cost, this cost can at times be counterbalanced by subsequent benefits. A good example of this is the advent of laparoscopic surgery. Given the higher price of laparoscopic equipment compared to standard equipment, along with the surgical learning curve and at times increased duration of procedures, you would be forgiven for thinking that laparoscopic procedures were invariably associated with a higher cost of treatment. However, as shown by Hayes et al. [31], although the initial cost of procedures such as laparoscopic colectomy can be higher, there are overall
improvements in cost effectiveness due to savings in reduced recovery days and quality-adjusted life years. Similarly, introducing a dedicated clinical pathway for procedures such as laparoscopic cholecystectomy can provide further cost advantages [32]. The decision-making of the individual surgeon is central to health care costs. Using Kissick’s decision making model, Fisher et al. have demonstrated that using faecal occult blood testing as a primary screening tool for colorectal cancer can give similar sensitivity to that of colonoscopy, while significantly improving access with huge cost savings [33]. These examples show that improving the quality of care that a patient receives by simply improving the efficiency of health care delivery or using evidencebased practice can result in additional economic benefits. Providing high quality care does not necessarily have to cost more.
13.4 Benchmarking Quality of Care 13.4.1 Current Initiatives The need for maintenance of high standards and the improvement of quality of care is well recognised. There are a number of existing programmes dedicated
13
How can we Assess Quality of Care in Surgery?
3.2
159 18
a
2.8 2.6 2.4
14
12
10
2.2 2.0
b
16 30-Day Mortality, %
30-Day Mortality, %
3.0
Phase 1
Phase 2 FY 1996 FY 1997 FY 1998 FY 1999 FY 2000
8
Phase 1 Phase 2 FY 1996 FY 1997 FY 1998 FY 1999 FY 2000
Fig. 13.5 The 30 day post-operative mortality (a) and morbidity (b) for all major operations performed in the Department of Veterans Affairs hospitals throughout the duration of the National Surgical Quality Improvement Program data collection process.
A 27% decrease in the mortality and a 45% decrease in the morbidity were observed in the face of no change in the patients’ risk profiles. FY indicates fiscal year (figure reproduced with permission from reference [36])
to the improvement of quality of care. The majority of these base their work on performance benchmarking. Performance benchmarking is a tool that allows organisations to evaluate their practice as compared to accepted best practice. If any deficiencies exist, adjustments can be made with the aim of improving the overall performance. This process must be continuous as health care is a continually evolving entity. Currently, health care institutions are either benchmarked against national targets or each other as a means of comparison. This approach identifies “good” and “bad” outliers and a cohort of “average” performers. It also serves to identify inequalities that exist, which can then be addressed. This method of benchmarking does help to maintain a nationwide drive to continuously improve services, although there are critics of any system that arbitrarily “rank” performance without due consideration for underlying causative factors. The Health care Commission is an independent body that promotes improvements in quality of care in both the NHS and independent health sectors in England and Wales. Its role is to assess and report upon the performance of health care organisations to ensure high standards of care. It evaluates performance against targets set by the Department of Health. The Health care Commission also looks at clinical and financial efficiency, giving annual performance ratings for each NHS Trust. The areas which are looked at are generalised, and include categories such as patient safety, clinical and cost effectiveness, governance and waiting times. The U.K. Quality Indicator Project (U.K. QIP) [34] is part of an international programme (International Quality Indicator Project) that was started in the USA
in 1985. U.K. QIP is a voluntary exercise based upon the anonymous feedback of comparative data to encourage internal improvement within health care organisations. There is no system for publication of results or external judgement of data. By using performance indicators, the aim of the project is not to directly measure quality, but to identify areas that require further attention and investigation. Examples of surgical performance indicators include rates of hospital-acquired infections, surgical site infections, in-patient mortality and readmission rates. A similar project exists in the USA alone called the National Surgical Quality Improvement Programme [35]. This nationwide programme was started by the Department of Veterans Affairs (VA) to monitor and improve the standards of surgical care across all VA hospitals, and has been slowly introduced into the private sector since 1999 (Fig. 13.5). Performance benchmarking is a useful exercise for ensuring that the minimum standard of care that can be expected is attained, but is far too vague and imprecise to inform us, if we are delivering high quality care. This is the same problem that any generalisable quality assessment tool will experience, as it also will be unable to appreciate the intricacies of disease-specific high quality health care.
13.4.2 Pay for Performance Strategies Pay-for-performance (P4P) programmes use financial reimbursements for clinical providers as a “reward”
160
for a positive change in performance measures. It is thought that this will help drive further improvements in quality of care. These programmes have gained popularity in recent years with new initiatives in the USA [37, 38], UK [39], Australia [40] and Canada [41], being based in both hospital and primary care. P4P programmes typically focus upon process measures as these can detect suboptimal care in a timely manner, while being directly under control of the clinician. There are a multitude of variations of P4P programmes, with incentives being paid either to individual clinicians, clinician groups, clinics, hospitals or multi-hospital collaborations. Similarly, the amount of incentive required per measure can vary from $2 to $10,000, with incentives received either for reaching absolute thresholds of care, relative thresholds (such as a 30% increase in performance) or even a pay-per-case arrangement. Although studies have shown that P4P programmes can have positive effects on quality measures, these gains may be only modest [37, 42]. Cost-effectiveness is also unclear with some studies showing massive savings [43], and others showing gross overspending [44]. Perhaps the most worrying aspect of P4P programmes is the unintended adverse consequences that can result. Examples of these include “gaming” strategies where clinicians avoid sick or challenging patients, or reclassify patient conditions or even claiming their incentive when care has not been provided. Similarly, patients may receive substantial “over-treatment” of their medical conditions. In fact, the NHS P4P programme in the U.K. found that the strongest predictor of improvement in achievement was the exclusion of patients from the programme [39]. On the other hand, clinicians and hospitals serving the more disadvantaged populations may see their income fall as targets and thresholds are difficult to reach. Although P4P programmes have been shown to improve performance in key clinical areas, they can potentially have multiple problems if not subject to careful design and regular evaluation. In order to be successful, these programmes must be implemented with the involvement of clinicians from the very start to prevent unintended harm coming to the patient.
13.4.3 Future Direction The greatest advances in the area of quality assessment will be in the realm of measurement of process and
E. Mayer et al.
overall performance. Inclusion of structural variables will also need to be considered. This should not be confused with set performance targets, or blindly following clinical guidelines. These newly developed assessment scores should allow us to implement changes that will not only improve that score, but more importantly, improve the quality of care delivered to our patients. For example, a performance measure that tells us a certain proportion of the population underwent a particular desired process is not enough. It does not give us any indication on how to improve quality. It is important to know why certain members of the population failed to achieve this goal. Was it through contra-indications to that process, lack of communication, lack of compliance or something else? This information can give further understanding of what changes are needed to improve the service that is provided. In short, knowing that improvement is needed is important, but more helpful is the knowledge of how to improve. Engagement in this process by institutions and clinicians is crucial. They have, to date, been reluctant as the public reporting of performance data has had a “name-and-shame” style by inappropriately ranking them against each other without duly considering institutional variations that cannot be adjusted for, but which result in varying explicable performance. Better methods of visually presenting performance data that avoid arbitrarily ranking health care providers, that are interpretable at face value to the lay person and which still continue to identify trusts which need special attention will help to engage all stakeholders in future performance benchmarking.
13.5 Public Health Implications The assessment of quality of care is a public health issue that is becoming a dominant theme in structuring modern health care. The rigorous and accurate measurement of quality is an essential component for the improvement of public health services and answering public accountability. The methods by which quality is assessed have the potential to dictate health care policy well into the future, and as practicing surgeons, we must all be well educated on this topic. Some examples of the current assessment of quality of care can be gathered from the internet sources in the table below (Table 13.1). The field of cardiothoracic surgery has long been aware of the push towards quality improvement and
13
How can we Assess Quality of Care in Surgery?
Table 13.1 Internet resources for quality of health care Organisation
URL
The Leapfrog Group
http://www.leapfroggroup.org/
The National Surgical Quality Improvement Program
https://acsnsqip.org/login/ default.aspx
The International Quality Indicator Project
http://www.internationalqip. com/
The Healthcare Commission
http://2007ratings.healthcarecommission.org.uk/homepage. cfm
The Institute for Health care http://www.ihi.org/ihi Improvement Agency for Health care Research and Quality
http://www.ahrq.gov/qual/ measurix.htm
The National Committee for Quality Assurance
http://web.ncqa.org/
The Institute of Medicine’s Health Care Quality Initiative
http://www.iom.edu/ CMS/8089.aspx
The National Association for Health care Quality
http://www.nahq.org/
Quest for Quality and Improved Performance
www.health.org.uk/qquip
public accountability. In the1980s, the Society of Thoracic Surgeons (STS) initiated one of the largest data collection operations in medicine, resulting in the STS National Adult Cardiac Surgery Database. It is now the largest and most comprehensive single speciality database in the world. It allows not only surgeons and trusts to compare their results, but is also freely publicly available and patients can identify their own surgeon’s outcomes. With increasing emphasis placed upon quality and performance measurement, the STS set up the Quality Measurement Task Force to create a comprehensive quality measurement programme for cardiothoracic surgery. The results of this programme have recently been published [45, 46] and (at time of press) may represent the most up-to-date and rigorous methods by which quality assessment can be performed. Undoubtedly, with further investigation and reporting of the factors driving quality of care, inequality of health care provision will be uncovered. No one doubts that all patients should have equity of quality of care and it can result in more lives saved. The Leapfrog group in the U.S. now recommends that there is a
161
certified critical care specialist available for their ICUs, and estimated that this restructuring could save more than 54,000 lives in the U.S. per year [47]. But can the current health care infrastructure manage geographical fluxes in demand that may result from patients mobilising their freedom of choice and seeking out “better care”? Often institutions that are currently able to provide higher quality health care can do so only under the restraints of their current patient population demand. Any reasonable increase in this demand can have a negative impact and subsequently lead to a worsening of the quality of their health care provision.
13.6 How to Design Health Care Quality Reforms Objectives to improve quality of care are well recognised, but the implementation of systems in order that these objectives are met is far from straight forward. Translating research evidence into the everyday structure and processes of a health care system is feasible, but made difficult by the variation that exists across health care systems and between health care providers. Leatherman and Sutherland [48] describe a conceptual framework to facilitate the designing of health system reforms that consists of three aspects: • A taxonomy to organize the available evidence of potential quality enhancing interventions (known as the QEI project) • A multi-tiered approach to select and implement interventions in a health care system at four levels: national, regional, institutional and the patient–clinician encounter • A model to guide the adoption of a balanced portfolio approach to quality improvement – recognizing the prudence of simultaneously employing professional, governmental and market levers for change. The QEI project encompasses several aspects of quality improvement, such as effectiveness, equity, patient responsiveness and safety. It itself forms part of a wider initiative called the Quest for Quality and Improved Performance, a 5-year international collaborative research project between the University of North Carolina, School of Public Health, London School of Economics, University of York and University of
162
Cambridge. The limitations of using evidence-base to bring about health reform are recognised, such as publication bias and difficulties in translating evidence from one health care system into another, but early results of the QEI project are generating some good examples of focused quality interventions. The integration of evidence-based interventions needs to occur across all levels of a health care system in order that predictable systemic improvement in quality arises. This “multi-tiered approach to building predictable systemic capacity for improvement” describes three key factors: “horizontal coherence”, the interaction of several different types of quality interventions; “vertical coherence”, the interaction of a quality improvement intervention across the multiple levels of the health care system and “coherence in accountability”, the balance between professionalism and professional accountability, centralised governmental control and market forces. Coherence in accountability forms the components for a “balanced portfolio approach to quality improvement” and recognises that individually professionalism, government or market factors cannot generate sustainable quality change.
13.7 How to Achieve Health Care Quality Improvement As highlighted by Glickman et al. [49], there has been proportionally more attention directed towards the “process” and “outcome” components of Donabedian’s structure, process, outcome framework for quality. In today’s modern health care system, “structure” consists of important organisational and managerial components that are the enablers for driving forward multi-dimensional quality improvement agenda. Glickman et al. describe these organisational characteristics from a management perspective executive management, including senior leadership and board responsibilities, culture, organisational design, incentive structures and information management and technology. The distinctive aspect to this work is the combination of business and medical viewpoints to provide a contemporary operational definition of structure that updates Donabedian’s “physical characteristics” description. This framework engages managerial capabilities crucial to achieving health care quality improvement.
E. Mayer et al.
13.8 Conclusions Assessing quality of care in surgery is an important and essential part of maintaining and improving patient care. The very act of measurement serves to determine current standards and provides a baseline against which improvement can be made locally and/or nationwide. Benchmarking between providers will assist in identifying inequalities at a provider or regional level. Further investigation of the causative factors will discover pockets of best practice and local innovation that can then be disseminated more widely. Although traditional assessments of quality have been heavily influenced by a number of clinical outcome measures such as mortality and morbidity, we have shown that these are clearly inadequate in isolation and do not provide a reliable assessment of the quality of surgical service. As described by Donabedian some 40 years ago, quality of care can be explained by three key elements, structure, process and outcome. Treatment outcome measures will still form an important part of quality assessment as they are easily understandable to the clinician and patient alike, and outcomes such as postoperative mortality remain an important endpoint. We will see expansion of the use of patient reported outcomes such as quality of life and current health status, in order to achieve a well-rounded viewpoint on quality care. Measurement of structural and process of care variables must be used in combination with outcome measurements and has the significant advantage that they are less influenced by factors such as case-mix and patient co-morbidities. This will help to overcome the methodological difficulties of producing suitable adjusted data. These structural variables and process measures are not, however, currently widely or routinely collected, and in order for this to change, will require a labour-intensive undertaking and undoubtedly require additional resources. Health care economics is a further key element in assessing quality of care and has significant impact upon modern surgical care which is heavily influenced by continually evolving technological and biosurgical innovation. A lack of available finances will always act as an inhibitor to delivering the highest quality of care. The combination of surgical innovation, making more people eligible for treatment, with an ageing and increasingly demanding public means that financial constraints will remain a considerable factor for the foreseeable future.
13
How can we Assess Quality of Care in Surgery?
A quality of care assessment tool should be multifactorial, taking into account the entire patient treatment episode. It should include up-to-date process measurements gleaned from evidence-based medicine and national guidelines. It should consider patient-centred and well as disease-specific clinical outcome measures and incorporate structural variables indicating effective and efficient health care delivery. In this way, we can be confident that we can obtain the most accurate and valid assessment of quality of care in Surgery.
References 1. Chassin MR, Galvin RW (1998) The urgent need to improve health care quality. Institute of Medicine National Roundtable on Health Care Quality. JAMA 280:1000–1005 2. Anonymous (1986) Quality of care. Council on Medical Service. JAMA 256:1032–1034 3. BUPA Hospitals. Available at http://www.bupa-hospitals. co.uk/asp/patientcare/quality.asp#2. Accessed July 2007 4. Donabedian A (1966) Evaluating the Quality of Medical Care. The Milbank Memorial Fund Quarterly 44:166–203 5. Donabedian A (1990) The seven pillars of quality. Arch Pathol Lab Med 114:1115–1118 6. Schiff GD, Rucker TD (2001) Beyond structure-processoutcome: Donabedian’s seven pillars and eleven buttresses of quality. Jt Comm J Qual Improv 27:169–174 7. Brook RH, Park RE, Chassin MR et al (1990) Predicting the appropriate use of carotid endarterectomy, upper gastrointestinal endoscopy, and coronary angiography. N Engl J Med 323:1173–1177 8. Brook RH, Park RE, Chassin MR et al (1990) Carotid endarterectomy for elderly patients: predicting complications. Ann Intern Med 113:747–753 9. The Leapfrog Group. Available at http://www.leapfroggroup.org. Accessed June 2007 10. Khuri SF, Daley J, Henderson W et al (1999) Relation of surgical volume to outcome in eight common operations: results from the VA National Surgical Quality Improvement Program. Ann Surg 230:414–429; discussion 429–432 11. Elixhauser A, Steiner C, Fraser I (2003) Volume thresholds and hospital characteristics in the United States. Health Aff (Millwood) 22:167–177 12. Pronovost PJ, Angus DC, Dorman T et al (2002) Physician staffing patterns and clinical outcomes in critically ill patients: a systematic review. JAMA 288:2151–2162 13. Treggiari MM, Martin DP, Yanez ND et al (2007) Effect of intensive care unit organizational model and structure on outcomes in patients with acute lung injury. Am J Respir Crit Care Med 176:685–690 14. Malin JL, Schneider EC, Epstein AM et al (2006) Results of the National Initiative for Cancer Care Quality: how can we improve the quality of cancer care in the United States? J Clin Oncol 24:626–634 15. Lilford RJ, Brown CA, Nicholl J (2007) Use of process measures to monitor the quality of clinical practice. BMJ 335: 648–650
163 16. Chassin MR, Kosecoff J, Park RE et al (1987) Does inappropriate use explain geographic variations in the use of health care services? A study of three procedures. JAMA 258: 2533–2537 17. Brook RH (1989) Practice guidelines and practicing medicine. Are they compatible? JAMA 262:3027–3030 18. Jamtvedt G, Young JM, Kristoffersen DT et al (2006) Audit and feedback: effects on professional practice and health care outcomes. Cochrane Database Syst Rev CD000259 19. Al-Ruzzeh S, Athanasiou T, Mangoush O et al (2005) Predictors of poor mid-term health related quality of life after primary isolated coronary artery bypass grafting surgery. Heart 91:1557–1562 20. Dr Foster Good hospital Guide. Available at http://www. drfoster.co.uk/ghg. Accessed May 2007 21. The Healthcare Commission. Available at http://www.healthcarecommission.org.uk/homepage.cfm. Accessed May 2007 22. Pitches DW, Mohammed MA, Lilford RJ (2007) What is the empirical evidence that hospitals with higher-risk adjusted mortality rates provide poorer quality care? A systematic review of the literature. BMC Health Serv Res 7:91 23. Hofer TP, Hayward RA (1996) Identifying poor-quality hospitals. Can hospital mortality rates detect quality problems for medical diagnoses? Med Care 34:737–753 24. Zalkind DL, Eastaugh SR (1997) Mortality rates as an indicator of hospital quality. Hosp Health Serv Adm 42:3–15 25. Thomas JW, Hofer TP (1999) Accuracy of risk-adjusted mortality rate as a measure of hospital quality of care. Med Care 37:83–92 26. NICE (2006) Update on Herceptin appraisal. National Institute for Health and Clinical Excellence, London 27. DH (2005) Healthcare output and productivity: accounting for quality change. Department of Health, London 28. National Programme Budget project. Available at http://www. dh.gov.uk/en/Managingyourorganisation/Financeandplanning/ Programmebudgeting/index.htm. Accessed September 2007 29. Office for National Statistics (2006) Public service productivity: health. Econ Trends 628:26–57 30. Atkinson T (2005) Atkinson review of government output and productivity for the national accounts: final report. HMSO, London 31. Hayes JL, Hansen P (2007) Is laparoscopic colectomy for cancer cost-effective relative to open colectomy? ANZ J Surg 77:782–786 32. Topal B, Peeters G, Verbert A et al (2007) Outpatient laparoscopic cholecystectomy: clinical pathway implementation is efficient and cost effective and increases hospital bed capacity. Surg Endosc 21:1142–1146 33. Fisher JA, Fikry C, Troxel AB (2006) Cutting cost and increasing access to colorectal cancer screening: another approach to following the guidelines. Cancer Epidemiol Biomarkers Prev 15:108–113 34. Thomson R, Taber S, Lally J et al (2004) UK Quality Indicator Project (UK QIP) and the UK independent health care sector: a new development. Int J Qual Health Care 16(Suppl 1):i51–i56 35. National Surgical Quality Improvement Program. Available at https://acsnsqip.org/main/about_history.asp. Accessed July 2007 36. Khuri SF, Daley J, Henderson WG (2002) The comparative assessment and improvement of quality of surgical care in the Department of Veterans Affairs. Arch Surg 137:20–27
164 37. Lindenauer PK, Remus D, Roman S et al (2007) Public reporting and pay for performance in hospital quality improvement. N Engl J Med 356:486–496 38. Park SM, Park MH, Won JH et al (2006) EuroQol and survival prediction in terminal cancer patients: a multicenter prospective study in hospice-palliative care units. Support Care Cancer 14:329–333 39. Doran T, Fullwood C, Gravelle H et al (2006) Pay-forperformance programs in family practices in the United Kingdom. N Engl J Med 355:375–384 40. Fourth National Vascular Dataset Report. The Vascular Society of Great Britain and Ireland. Available at http://www.vascularsociety.org.uk/committees/audit.asp. Accessed May 2007 41. Pink GH, Brown AD, Studer ML et al (2006) Pay-forperformance in publicly financed healthcare: some international experience and considerations for Canada. Healthc Pap 6:8–26 42. Petersen LA, Woodard LD, Urech T et al (2006) Does payfor-performance improve the quality of health care? Ann Intern Med 145:265–272 43. Curtin K, Beckman H, Pankow G et al (2006) Return on investment in pay for performance: a diabetes case study. J Healthc Manag 51:365–374; discussion 375–6
E. Mayer et al. 44. Roland M (2006) Pay-for-performance: too much of a good thing? A conversation with Martin Roland. Interview by Robert Galvin. Health Aff (Millwood) 25: w412–w419 45. O’Brien SM, Shahian DM, DeLong ER et al (2007) Quality measurement in adult cardiac surgery: part 2–Statistical considerations in composite measure scoring and provider rating. Ann Thorac Surg 83:S13–S26 46. Shahian DM, Edwards FH, Ferraris VA et al (2007) Quality measurement in adult cardiac surgery: part 1–Conceptual framework and measure selection. Ann Thorac Surg 83: S3–12 47. Birkmeyer JD, Birkmeyer CM, Wennberg DE et al (2000) Leapfrog patient safety standards: the potential benefits of universal adoption. Leapfrog Group: Washington 48. Leatherman S, Sutherland K (2007) Designing national quality reforms: a framework for action. Int J Qual Health Care 19(6):334–340 49. Glickman SW, Baggett KA, Krubert CG et al (2007) Promoting quality: the health-care organization from a management perspective. Int J Qual Health Care
14
Patient Satisfaction in Surgery Andre Chow, Erik Mayer, Lord Ara Darzi, and Thanos Athanasiou
Abbreviations
Contents 14.1
Abbreviations .......................................................... 165
14.1
Introduction ............................................................ 166
14.2
The Patient’s Perspective of Health Care............. 167
14.3
Patient Satisfaction................................................. 167
14.3.1 The Meaning of Satisfaction .................................... 14.3.2 Determinants of Satisfaction: Patient Expectations ................................................. 14.3.3 Determinants of Satisfaction: Patient Characteristics .............................................. 14.3.4 Determinants of Satisfaction: Psychosocial Factors ................................................ 14.3.5 Components of Satisfaction ..................................... 14.3.6 Patient Dissatisfaction .............................................. 14.3.7 The Importance of Measuring Patient Satisfaction ...................................................
PRO U.K. U.S.
Patient reported outcomes United Kingdom United States
167 167 167 168 168 169 170
14.4
Measurement of Satisfaction ................................. 170
14.4.1 14.4.2 14.4.3 14.4.4
How can we Measure Satisfaction? ......................... The “Overall Satisfaction” Score ............................. Satisfaction Survey Design ...................................... Guidance for Satisfaction Measurement ..................
14.5
Conclusions ............................................................. 172
170 170 171 172
References ........................................................................... 172
Abstract Patient satisfaction is one of the most important patient reported outcomes and can be thought of as an ultimate endpoint to the assessment of health care quality. Although patient satisfaction has been studied for many years, a lack of understanding and absence of a precise definition of satisfaction have been flaws in the majority of research to date. Persistently high patient satisfaction ratings over many years may in fact reflect poorly constructed measurement tools, as opposed to high quality care. This chapter explores the meaning of patient satisfaction, including the analysis of satisfaction determinants and satisfaction components. The importance of satisfaction measurement is also discussed, and guidance on creating satisfaction measurement tools proposed.
14.1 Introduction
A. Chow () Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St Mary’s Hospital Campus, Praed Street, London, W2 1NY, UK e-mail:
[email protected] In the past, the goals of medicine were to reduce the morbidity and mortality of diseases that affected patients. While these are still valid and noble goals, the aims of modern health care have now evolved beyond this. As well as improving morbidity and mortality statistics, we now aim to improve aspects such as functional and cognitive status, quality of life (QOL) and productivity [7]. This enables us to ensure the highest quality of care.
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_14, © Springer-Verlag Berlin Heidelberg 2010
165
166
Measuring the quality of health care is an essential part of modern medicine. As mentioned in other chapters, being able to define and measure quality of care has clinical benefits in terms of ensuring high quality care and driving continuing improvement, as well as economic benefits. Outcome measures such as morbidity and mortality have traditionally been used as surrogate markers of quality care. However, these traditional outcome measures give us only a one-sided view and are inadequate to evaluate the multi-dimensional goals of modern health care. The future assessment of the quality of care should also focus on more patient centred measures, including the measurement of patient satisfaction.
14.2 The Patient’s Perspective of Health Care The measurement of health care quality can be taken from two perspectives, those of the health care provider and those of the patient. As doctors, there is a tendency to view outcomes from a health care provider’s point of view. However, it must be remembered that the patient is the most important individual in the health care system, and thus, the patient should be central to all that we do as a health care professional. The patient’s viewpoints on treatment outcomes may be completely different from the viewpoints of a health care professional. For example, the impact of treatment side effects such as impotence following radical prostatectomy or the need for a stoma following colonic resection may be lessened by a health care provider, who may be more interested in outcomes such as blood loss, postoperative infection and 30-day mortality rates. However, from the patients’ perspective, long-term sequelae such as impotence and the need for a stoma would be at the forefront of their minds. The use of patient reported outcomes (PROs) can greatly enhance the assessment of quality of care. PROs provide us with the patients’ view on their health condition and their treatment. It shows us the patients’ perspective and assessment of their symptoms, functional status, productivity and satisfaction. In an ideal system, this viewpoint should be integral to our decision-making process, and our assessment of the care that we provide. This is especially important for fields such as oncology where several different treatment options may exist, and where survival gains can be small with significant treatment side effects [12].
A. Chow et al.
The patient’s perspective is a multi-dimensional one, but can be broadly split into three components, namely QOL, current health state and patient satisfaction with care. These three components are the most important PROs that can be used to measure patient orientated quality of care. The current health state of a patient can be thought of as the symptoms and the overall well-being of a patient as a result of a disease or treatments. Knowing and understanding the patients’ health state allows us to see the position that the patient is coming from. The health state of patients can directly affect the QOL of patients as well as their satisfaction with health care. QOL has been extensively researched and has evolved significantly over the past 30 years [25]. As health care professionals, we tend to concentrate most on the healthrelated QOL, which is a multi-dimensional assessment of the physical, psychological and social aspects of life that can be affected by a disease process and its treatment [6]. Physical function is the ability to perform a range of activities of daily living as well as including physical symptoms resulting from a disease or its treatments. Psychological function encompasses the emotional aspect caused by a disease or its treatments, and may vary from severe distress to a sense of well-being. This may also include cognitive function. Social function refers to aspects of social relationships and integration. In addition, there may be supplementary issues which are specific to a particular disease. The QOL of patients can directly influence their overall satisfaction with care. Patient satisfaction can provide an ultimate endpoint to the assessment of health care quality. It is jointly affected by current health state as well as QOL and helps to give us a balance against a provider-biased perspective. It is thus an essential part of quality assessment [25] (Fig. 14.1).
Quality of Life
Patient Satisfaction
Health Care Provision
Health State
Fig. 14.1 Patient reported outcomes of health care provision
14
Patient Satisfaction in Surgery
14.3 Patient Satisfaction 14.3.1 The Meaning of Satisfaction Patient satisfaction is a multi-faceted construct. Although the concept of patient satisfaction and its measurement has been studied by health care professionals for years, there has been a distinct lack of attention to the meaning of patient satisfaction. This has been thought of as the greatest flaw in patient satisfaction research [34]. As described by Ware et al [32], patient satisfaction can be split into two distinct areas. First, there are satisfaction determinants, otherwise thought of as a reflection of patient variables that can affect satisfaction such as patient expectations, patient characteristics and psychosocial determinants. Second, there are satisfaction components which refer to a measure of the care actually received.
14.3.2 Determinants of Satisfaction: Patient Expectations Patient satisfaction is now a recognised part of quality assessment. It is easy to imagine that high levels of reported satisfaction will correlate with high quality care. The underlying truth is, however, much more complex. The use of satisfaction as a measure of quality should not be taken at face value. It should always be interpreted with the understanding of the rationale that underlies those expressions of satisfaction [21]. A patients’ expectation of health care can greatly colour their perception, and thus, satisfaction with care. Different patients can hold differing expectations for different aspects of care which have been shown to predict overall patient satisfaction [1]. For example, take the young gentleman who visits his family doctor complaining of a sore throat. The doctor correctly diagnoses a viral illness and informs the patient that there is no requirement for antibiotics and sends the patient home. The satisfaction that the patient gets from this encounter may depend significantly on his preformed expectations. He may have expected to get antibiotics to treat his symptoms because this is how he has been treated in the past. In this case the reality of the situation would not have met his expectations. Unless the doctor does a very good job in explaining to the patient that antibiotics were not
167
necessary, the overall satisfaction from the encounter may poor. Alternatively, the patient may not have had any expectations of treatment. This scenario may lead to the patient being entirely satisfied by the consultation. The same clinical encounter can therefore lead to the patient having completely different levels of satisfaction depending upon his expectations. This idea that patient satisfaction was related to a patients’ perception of care and how that met their expectations was first explored by Stimson and Webb in 1975 [27]. They divided expectation into three categories, background, interaction and action. Background expectations are explicit expectations formed from accumulated knowledge of the doctor–patient interaction and consultation–treatment process. Background expectations will vary according to the illness and individual circumstances, but there are certain routines and practices that are expected and variance from these often leads to dissatisfaction. Interaction expectations refer to patients’ expectations on how they will interact with their doctor, e.g. the doctors ‘bedside manner’, the form of questioning and examination. Action expectations refer to a patient’s expectation on what action the doctor will take, e.g. the prescription of medications, referral to a specialist or advice. The concept of expectations indicates that satisfaction is associated with the fulfilment of positive expectations. However, expectations will change with time and accumulating knowledge. It has been noticed that increasing quality of care synchronously increases levels of expectations. As a result, it is possible that increasing quality of care may lead to a paradoxical lowering of satisfaction.
14.3.3 Determinants of Satisfaction: Patient Characteristics If patient satisfaction is a subjective measure, it is only logical that satisfaction may depend upon patient characteristics such as age, gender, socio-economic class, education, religion and so forth. Age has repeatedly been shown to be the most constant socio-demographic determinant of patient satisfaction. Numerous studies have demonstrated that older people tend to be more satisfied with health care than the younger generation [3]. The elderly population tends to demand less information from their doctors, are more satisfied with primary and hospital care, and
168
are more likely to comply with medical advice [18]. Educational status is also often thought to influence satisfaction with care. There has been much data from the United States (U.S.) showing that a higher level of education correlates with a lesser degree of satisfaction with care [13]. However, this has not been supported by data from the United Kingdom (U.K.). It is possible that there are other influences such as income that have confounded the U.S. evidence. The relationship between satisfaction and social class is unclear. Although a metaanalysis by Hall and Dornan [13] did demonstrate that greater satisfaction was associated with higher social class, they found that social class was not assessed by many studies. The contradictory results for social class and education caused some uncertainty. The role of gender in patient satisfaction has also displayed conflicting data. In general, it has been found that patient’s gender does not have a bearing on satisfaction ratings [13, 16], although there have been studies showing a reduced satisfaction rating from female patients [18]. Ethnicity, by its diverse nature, may also have a complex influence on satisfaction scores. Studies from the U.S. show that in general, the Caucasian population tends to be more satisfied with care than the nonCaucasian population [23]. However, these data may be confounded by socio-economic status [8]. In the U.K., the majority of work has focused upon the Asian population. Jones and Maclean found that major problems were encountered with language difficulties, as well as perceived attitudes towards Asian patients and hospital catering [17]. There was also particular distress caused by male physicians examining Asian female patients [22]. The relationship between socio-demographic variables and patient satisfaction is obviously not straightforward. Although many separate studies have shown how individual socio-demographic variables can affect satisfaction, these effects may only be a minor predictor of satisfaction overall [13].
14.3.4 Determinants of Satisfaction: Psychosocial Factors A number of psychosocial factors may affect the way a patient expresses satisfaction [20]. In general, these tend to produce an overestimation of satisfaction ratings. The cognitive consistency theory implies that a patient
A. Chow et al.
will respond positively to a satisfaction questionnaire in order to justify his or her own time and effort spent obtaining treatment. The Hawthorne effect describes how the very act of surveying for patient satisfaction increases the apparent concern of the health care programme, thereby improving satisfaction responses [26]. Indifference also plays a part; patients may feel that problems will not be resolved either because they are too large or too trivial, thereby making accurate reporting of their satisfaction levels irrelevant. Social desirability response bias causes patients to give positive responses to satisfaction surveys as they feel that these are more acceptable to satisfaction researchers [26]. Ingratiating response bias occurs when patients use positive responses to try and ingratiate themselves to health care staff, especially when anonymity is suspect. Self-interest bias implies that patients respond positively as they feel this will allow the health care programme to continue running which is in their best interest. There has also been concern that patients may be reluctant to complain in fear of prejudice from their health care workers in the future [24]. Gratitude is also a common confounding factor. In the U.K., the effect of gratitude has been more associated with the elderly population [26].
14.3.5 Components of Satisfaction The components of satisfaction refer to the patients perceptions of the actual care that they received. These can be either specific areas of care such as waiting times, communication, access to care, etc., or can be more generalised, assessing overall quality of care. The patients’ response to these components are (as explained earlier) influenced by their satisfaction determinants. There have been numerous attempts to try and classify the components of patient satisfaction. A commonly quoted classification was developed by Ware et al. in 1983 [32] and later adapted by Fitzpatrick [10] to suit the U.K. setting. This classification involved seven items, reflecting the most common factors included in satisfaction surveys: • • • •
Interpersonal manner Technical quality of care Accessibility and convenience Efficacy and outcomes of care
14
Patient Satisfaction in Surgery
• Continuity of care • Physical environment of care • Availability of care. Interpersonal manner is often thought of as the principal component of patient satisfaction and consists of two predominant elements, communication and empathy [26]. Thus, successful interactions depend upon the social skills of the health care workers. Positive satisfaction ratings are known to be associated with non-verbal communication such as body positioning, head nodding and eye contact [19]. These aspects of care are often dismissed as unimportant by the medical profession [29]. Tishelman in 1994 [30] discovered that almost every encounter described by the patient as “exceptionally good” focused upon aspects of interpersonal interaction such as kindness and empathy as opposed to technical competence. Technical quality of care is naturally important to the patient. But how can a lay person judge the technical skill of a doctor, nurse or other health care worker? In fact, patients seem to be more comfortable commenting upon personal qualities of doctors and nurses rather than commenting upon their technical skill [9]. It has been demonstrated also that patients’ perceptions about their doctors’ skill and abilities is mostly determined by personal qualities such as friendliness [2]. There is also a danger that patients view the process of technical interventions as evidence of quality, where higher levels of technical intervention correspond with higher satisfaction ratings [15]. One possible reason why patients do not seem to emphasise technical competence in comparison to interpersonal manner is that they assume a basic level of technical competence [26]. This may explain why other aspects, such as interpersonal manner, come to the forefront. Accessibility issues include aspects such as physical access to hospitals and clinics, appointment systems, waiting lists and home visits. Long waiting times (which is especially prevalent in the U.K. [31]), parking problems and even public transport have all led to reduced satisfaction ratings [1]. The components of satisfaction can be quite succinctly summarised in the ‘three A’s’ of medical consultation, accessibility, affability and ability (Fig. 14.2). As doctors, we tend to concentrate mostly on the last ‘A’: ability. We feel that improving our knowledge and skill, and thus, our ability to treat patients makes us the best doctor we can be. However, from patients’ perspective,
169
Accessibility
Affability
Medical Consultation
Ability
Fig. 14.2 The three A’s of medical consultation. Patients tend to concentrate on accessibility and affability, while medical professionals concentrate on ability
accessibility to health care services and affability of health care workers predominate their thoughts. A doctor’s ability may not even be considered by the patient, if the doctor is neither accessible nor affable.
14.3.6 Patient Dissatisfaction There is a stark lack of variability in the majority of satisfaction surveys. It is only a small minority of patients who express dissatisfaction or criticism of their care [1]. In the U.K., stable overall satisfaction rates greater than 90% have been demonstrated for many years in primary care [18], and of 80% or more in hospital settings [35]. These good results may seem ideal for the health care system, but the lack of variability can cause problems for health care researchers who find it difficult to compare positive with more positive results. If we are to use patient satisfaction as an indicator of high quality care, the assessment of satisfaction must be sensitive enough to detect changes in health care quality. With current satisfaction surveys giving such little variability in results, surely the way in which we are assessing satisfaction must be questioned. Not only are our questionnaires obviously not sensitive enough to detect change, but they also do not help us implement change in order to improve our services. If the focus is taken away from overall satisfaction, and directed towards specific aspects of care, more variability can be found. Questions of a more detailed nature elicit greater levels of dissatisfaction than generalised questions [35]. Similarly, different questioning procedures and types of scale used can also affect the degree of dissatisfaction
170
expressed by patients [33]. It has also been shown that the volume of comment can be a more sensitive indicator of satisfaction than overall ratings [5]. An alternative and commonly used satisfaction model is the “discrepancy model” [4, 26]. This model states that the lack of variability seen in satisfaction research should steer us away from aiming for consistency of satisfaction. Instead, we should be concentrating on dissatisfaction and where there is discrepancy in results. In other words, we need to know what is wrong, not what is right. Understanding the situations that lead to discrepant findings should be more important than attaining high satisfaction results.
14.3.7 The Importance of Measuring Patient Satisfaction Understanding the concept of patient satisfaction is important on many levels. The measurement of satisfaction allows us to change our practice to improve the quality of care that we can provide to our patients. This will not only directly improve health outcomes, but will have many other additional beneficial effects. At a national level, attention to patient satisfaction and a drive to increase satisfaction with services will help to improve the public’s perception of the health care system. Along with a greater pride, there may be a greater willingness to invest in the health care system. Increasing satisfaction with health care will also lead to increasing trust in those who create health care policy. On a more individual level, an improved satisfaction with care could result in greater compliance with care. Patients may be more likely to follow their doctors’ instructions regarding lifestyle modifications, as well as medications, if they were more satisfied with the care that they received. This would lead to beneficial changes in the health of the population. Similarly there may be less misuse of health care services. Together with a boost in morale and the provision of positive feedback to health care workers, efficiency and productivity of the health care system as a whole could improve. Thus, the incorporation of patient satisfaction as a part of health care quality may have benefits not just at the level of a single patient’s perceptions, but can also affect the wider interest of the population’s health and health care policy. It is not untenable to think that
A. Chow et al. Table 14.1 Web links for further information on national patient surveys NHS Surveys
http://www.nhssurveys.org/
Department of Health
www.dh.gov.uk/en/ Publicationsandstatistics/ PublishedSurvey/ NationalsurveyofNHSpatients/ Nationalsurveyinpatients/index.htm
Health care Commission
www.healthcarecommission.org.uk/ nationalfindings/surveys/healthcareprofessionals/surveysofnhspatients.cfm
Picker Institute
www.pickereurope.org/index.php
improved patient satisfaction could have benefit in terms of cost saving as a result of fewer complaints, “second opinions” and repeated investigations. Patient satisfaction could ultimately form a component of a world class commissioning process and the associated payment by results. The patient experience is already being taken into account by the use of nationwide patient surveys in the U.K.’s NHS. These are carried out jointly by the Department of Health and the Health care Commission. Further information on these surveys can be found by following the links in Table 14.1.
14.4 Measurement of Satisfaction 14.4.1 How can we Measure Satisfaction? As we have seen, the concept of satisfaction is a complex and multi-faceted one. The scope for manipulation of the design, and thus the results, of satisfaction surveys is without boundary. If we are to reliably use satisfaction to aid us in our assessment of health care quality, satisfaction surveys must be carefully planned and tested. We must pay careful attention to the aspects of satisfaction that are measured, as well as the type of questionnaire used, and to the timing of the measurements.
14.4.2 The “Overall Satisfaction” Score We must firstly consider whether an “overall satisfaction” rating is a valid concept. We have seen that for
14
Patient Satisfaction in Surgery
many years, overall satisfaction ratings within the NHS have remained high. Yet when more specific questions are asked about exact areas of care, then there is generally more dissatisfaction elicited. We must therefore conclude that the overall satisfaction score is in fact masking varying levels of dissatisfaction with care. Can we still therefore use an overall satisfaction score to assess our quality of care? It has been claimed that there are six dimensions that determine patient satisfaction [14]: medical care and information, food and physical facilities, non-tangible environment, quantity of food, nursing care and visiting arrangements. Now, even if this is true, the question is whether these dimensions can be combined to provide an overall satisfaction score. In order to do this, we must initially calculate a ‘weighting’ for each dimension. This in turn however, is complicated by the fact that for each patient, with his or her individual expectations and experiences, the ‘weight’ given to each dimension will differ. Two methods of eliciting weights have been recognised, the direct or indirect methods [4]. The direct method suggests that we ask patients to assign a numerical score to each of these dimensions in terms of importance. It has been argued that the main limitation of this method is the tendency of “raters” to assign equal measures to each dimension [28]. The indirect method suggests that we study patient responses to a range of questions, such as scenarios. Again this method has also found to be suspect as it is difficult to extract information about how the patients individually weighs and combines different dimensions to give their response [11]. Thus, we can see that the combination of specific satisfaction scores to produce an overall satisfaction score is fraught with problems. There is very little evidence to suggest that we can reliably provide a unified satisfaction score, and thus, its use to assess quality of care is undoubtedly in question. Such a construct would be unable to identify changes in quality, or specific areas that require improvement. More useful would be satisfaction measures that were specific for a particular clinical situation or health care area.
14.4.3 Satisfaction Survey Design The results of satisfaction surveys are particularly sensitive to their design. There are four crucial parameters
171
that can influence results, choice of population, timing, type of questionnaire and the rating of satisfaction [4]. The choice of population surveyed has a huge effect upon results. You can either survey the public as a whole (as they all are entitled to use the health service) or restrict it to current/recent users of the health service. Even if surveys are restricted to current users, there can still be large variations in the type of patients interviewed according to their patient characteristics and demographics. The timing of surveys is also important. The greater the length of time between health care service use and interview/questionnaire, the greater the chance of recall bias, changes in perceptions or appreciation of care and of patients overlooking aspects that bothered them at the time of care. The type of questionnaire is probably the most important methodological consideration. It is crucial that the type of question should not distort the patient’s view, but this can be difficult to achieve. There are two main types of question: Either “open ended” questions where a patient is asked to comment on an area of care, or a “closed” form of questioning where direct questions are asked about satisfaction with services. With ‘open’ questioning, the patient is free to comment on areas of care from which we can infer satisfaction. “Closed” questioning gives us quantitative evaluations, but does not provide us with the situation the patient is referring to. Direct questions tend to act well as probes to discover dissatisfaction with areas of care that may not be mentioned with an ‘open ended’ question. In the ideal survey, both types of questions should be included to avoid under-reporting of problems and also to identify areas for change. There are numerous ways to rate satisfaction. The most common way is by a categorical score following a question. The patient can choose from a variety of responses from “very satisfied”, “satisfied”, “dissatisfied”, to “very dissatisfied”. This simple method has its benefits, but can also be problematic. For example, a change from “satisfied” to “dissatisfied” may represent either an accumulation of small shifts in separate component areas, or a large shift in a single component. Responses to this form of rating system often tend to fall into two narrow bands, being only superficially indicative of high satisfaction levels [4]. Where there is a substantial change in satisfaction ratings, the cause for this is usually an obvious one.
172
14.4.4 Guidance for Satisfaction Measurement As demonstrated earlier, the measurement of patient satisfaction is riddled with obstacles. The authors feel that it is beyond the scope of this chapter to give a detailed guide to the assessment of satisfaction for each clinical setting. In fact, it may be near impossible to produce such a guide. We do feel, however, that there are some basic principles which should be considered before embarking on measuring satisfaction. We must remember that there are satisfaction determinants as well as components. A satisfaction rating on its own is of little value without having information of the patient’s perspective. Gathering detailed socioeconomic data is therefore important. It would also be ideal to gather information regarding the expectations of the patient. As satisfaction can be significantly affected by how reality meets expectation, assessment of expectation would be invaluable. The authors feel that the use of an overall satisfaction score may be severely limited. An overall score can mask underlying variability and provide a false sense of quality. Instead, we feel that assessment of satisfaction for individual dimensions of health care should be sought, without the need to combine them to produce a unified satisfaction score. As well as satisfaction, more attention should be paid to the expression of dissatisfaction. Dissatisfaction may give us more helpful information as to where there are problems and what needs to be improved. Once we are able to reliably measure satisfaction, another issue needs to be addressed. What do we do with the information? An individual satisfaction rating on its own is of limited value, whereas the real interest lies in its comparison. We can either benchmark current scores with historical scores to determine progress, or benchmark scores with other departments, units or institutions as a method of performance measurement.
14.5 Conclusions The rise of a more consumerist society has had significant impact on modern medicine, with more emphasis on a patient-led health service. Patients are now better informed, more demanding and wish to exert more
A. Chow et al.
influence upon their health care. This increasingly patient centred approach has seen a drive to encompass the patient’s perspective in the assessment of health care quality. One of the major ways in which we can do this is by measuring patient satisfaction with health care. However, many have rushed to implement satisfaction surveys and measurements without proper consideration of the meaning of satisfaction and the interpretation of data. As a result, current satisfaction ratings have remained stable at a high level for many years. Although this may make our managers happy, it does not provide us with any useful information on how to improve our service. Further attention now needs to be paid to the meaning of satisfaction, its determinants and components. By collecting data on patients’ characteristics, and expectations, as well as hearing their viewpoints on varying aspects of health care, we can continue to identify problem areas and thus implement changes for improvement. Ultimately, if we are able to reliably assess patient satisfaction, we can truly mould our health care system around the person who matters most: our patient.
References 1. Abramowitz S, Cote AA, Berry E (1987) Analyzing patient satisfaction: a multianalytic approach. QRB Qual Rev Bull 13:122–130 2. Ben-Sira Z (1976) The function of the professional’s affective behavior in client satisfaction: a revised approach to social interaction theory. J Health Soc Behav 17:3–11 3. Blanchard CG, Labrecque MS, Ruckdeschel JC et al (1990) Physician behaviors, patient perceptions, and patient characteristics as predictors of satisfaction of hospitalized adult cancer patients. Cancer 65:186–192 4. Carr-Hill RA (1992) The measurement of patient satisfaction. J Public Health Med 14:236–249 5. Carstairs V (1970) Channels of communication. In: Scottish Health Service Studies 11. Scottish Home and Health Department, Edinburgh 6. Cella DF, Tulsky DS (1990) Measuring quality of life today: methodological aspects. Oncology (Williston Park) 4:29–38; discussion 69 7. Deyo RA (1991) The quality of life, research, and care. Ann Intern Med 114:695–697 8. Doering ER (1983) Factors influencing inpatient satisfaction with care. QRB Qual Rev Bull 9:291–299 9. Fitzpatrick R (1984) Satisfaction with health care. In: Fitzpatrick R (ed) The experience of illness. Tavistock, London 10. Fitzpatrick R (1990) Measurement of patient satisfaction. In: Hopkins D, Costain D (eds) Measuring the outcomes of
14
Patient Satisfaction in Surgery
medical care. Royal College of Physicians and King’s Fund Centre, London 11. Froberg DG, Kane RL (1989) Methodology for measuring health-state preferences – I: measurement strategies. J Clin Epidemiol 42:345–354 12. Ganz PA (1995) Impact of quality of life outcomes on clinical practice. Oncology (Williston Park) 9:61–65 13. Hall JA, Dornan MC (1990) Patient sociodemographic characteristics as predictors of satisfaction with medical care: a meta-analysis. Soc Sci Med 30:811–818 14. Health Policy Advisory Unit (1989) The patient satsifaction questionnaire. HPAU, Sheffield 15. Hopkins A (1990) Measuring the quality of medical care. Royal College of Physicians, London 16. Hopton JL, Howie JG, Porter AM (1993) The need for another look at the patient in general practice satisfaction surveys. Fam Pract 10:82–87 17. Jones L, Maclean U (1987) Consumer feedback for the NHS. King Edward’s Hospital Fund for London, London. 18. Khayat K, Salter B (1994) Patient satisfaction surveys as a market research tool for general practices. Br J Gen Pract 44:215–219 19. Larsen KM, Smith CK (1981) Assessment of nonverbal communication in the patient-physician interview. J Fam Pract 12:481–488 20. LeVois M, Nguyen TD, Attkisson CC (1981) Artifact in client satisfaction assessment: experience in community mental health settings. Eval Program Plann 4:139–150 21. Locker D, Dunt D (1978) Theoretical and methodological issues in sociological studies of consumer satisfaction with medical care. Soc Sci Med 12:283–292 22. Madhok R, Bhopal RS, Ramaiah RS (1992) Quality of hospital service: a study comparing ‘Asian’ and ‘non-Asian’ patients in Middlesbrough. J Public Health Med 14: 271–279 23. Pascoe GC, Attkisson CC (1983) The evaluation ranking scale: a new methodology for assessing satisfaction. Eval Program Plann 6:335–347
173 24. Raphael W (1967) Do we know what the patients think? A survey comparing the views of patients, staff and committee members. Int J Nurs Stud 4:209–223 25. Schwartz CE, Sprangers MA (2002) An introduction to quality of life assessment in oncology: the value of measuring patient-reported outcomes. Am J Manag Care 8: S550–S559 26. Sitzia J and Wood N (1997) Patient satisfaction: a review of issues and concepts. Soc Sci Med 45:1829–1843 27. Stimson G, Webb B (1975) Going to see the doctor: the consultation process in general practice. Routledge and Kegan Paul, London 28. Sutherland HJ, Lockwood GA, Minkin S et al (1989) Measuring satisfaction with health care: a comparison of single with paired rating strategies. Soc Sci Med 28: 53–58 29. Thompson J (1984) Communicating with patients. In: Fitzpatrick R (ed) The experience of illness. Tavistock, London 30. Tishelman C (1994) Cancer patients’ hopes and expectations of nursing practice in Stockholm – patients’ descriptions and nursing discourse. Scand J Caring Sci 8:213–222 31. Wardle S (1994) The Mid-Staffordshire survey. Getting consumers’ views of maternity services. Prof Care Mother Child 4:170–174 32. Ware JE Jr, Snyder MK, Wright WR et al (1983) Defining and measuring patient satisfaction with medical care. Eval Program Plann 6:247–263 33. Wensing M, Grol R, Smits A (1994) Quality judgements by patients on general practice care: a literature analysis. Soc Sci Med 38:45–53 34. Williams B (1994) Patient satisfaction: a valid concept? Soc Sci Med 38:509–516 35. Williams SJ, Calnan M (1991) Convergence and divergence: assessing criteria of consumer satisfaction across general practice, dental and hospital care settings. Soc Sci Med 33:707–716
How to Measure Inequality in Health Care Delivery
15
Erik Mayer and Julian Flowers
184
Abstract There has been an increased focus on health inequalities and equity even in developed countries over the last decade. Reducing health inequalities is an important policy objective. The origin of health inequalities and their development is a complex interplay between structural, social and individual factors influencing both population health and individual health. The study of health inequality is a vast, complex and rapidly developing field. The focus of this chapter will relate to our current understanding of the measures and dimensions of inequality in health care delivery. For areas not covered in depth in this chapter, suitable references for further reading will be provided where applicable.
185 186 188
15.1 Introduction
Contents 15.1
Introduction ............................................................ 175
15.1.1 Access to Health Care .............................................. 177 15.2
Dimensions of Inequality in Health Care Delivery ................................................................... 180
15.2.1 15.2.2 15.2.3 15.2.4
Patient-Level Characteristics .................................... Primary (Community) Care Characteristics ............. Secondary (Hospital) Care Characteristics .............. Limitations of Inequality Research ..........................
15.3
Measuring Inequality in Health Care Delivery .......................................................... 184
15.3.1 What to Measure – Measuring Health Care ............. 15.3.2 What to Measure – Measuring Health Inequality and Inequity ............................................ 15.3.3 Data Sources............................................................. 15.3.4 Limitations of Data Sources ..................................... 15.4
180 182 183 183
Methods for Measuring Health Care Inequality ................................................................ 189
15.4.1 Health Gap Analysis ................................................ 189 15.4.2 Share-Based Measures ............................................. 190 15.4.3 Methodological Limitations ..................................... 191 15.5
Conclusions ............................................................. 191
References ........................................................................... 192
E. Mayer () Department of Biosurgery and Surgical Technology, Imperial College London, 10th Floor, QEQM Building, St. Mary’s Hospital Campus, Praed Street, London, W2 1NY, UK e-mail:
[email protected] There has been an increased focus on health inequalities and equity even in developed countries over the last decade. For example, reducing health inequalities is an important policy objective of the British government. Despite the renewed focuses, several definitions of health inequality have been proposed including: • Differences in health that are avoidable, unjust and unfair [1] • Differences in health status or in the distribution of health determinants between different population groups [2] • Systematic and potentially remediable differences in one or more aspects of health across populations or population groups defined socially, economically, demographically or geographically [3] The strengths and weaknesses of these and other definitions are discussed in detail by Bravemen [4]. There is general consensus that health inequality exists when there
T. Athanasiou (eds.), Key Topics in Surgical Research and Methodology, DOI:10.1007/978-3-540-71915-1_15, © Springer-Verlag Berlin Heidelberg 2010
175
176
E. Mayer and J. Flowers
are potentially avoidable differences in health status or outcomes between disadvantaged people or populations, and those less disadvantaged. Disadvantage often encapsulates notions of social exclusion, deprivation or other forms of social or economic discrimination. It is apparent that contemporary definitions of health inequality incorporate the concept that there is a systematic trend to health inequality; inequalities in health do not occur randomly [5]. Having clear definitions aids measurement and understanding and can, therefore, direct appropriate action. Not all health inequality results from factors external to the patient and not all inequality, therefore, are inequitable. The terms inequality and inequity tend to be used interchangeably, although their meanings are discrete; inequality measures factual differences, whereas inequity incorporates a moral judgement, although health inequality generally implies a degree of unfairness and avoidability, and inequity tends to refer to distribution of health or health care according to need. Often we investigate inequality because it can point towards potential inequity. So for example, on the one hand, there are large variations in coronary heart disease death rates between geographical areas in the UK (and indeed between countries across Europe), i.e. there are inequalities in health outcomes; and on the other, there is a wide variation in the prescription of beta blockers as secondary prevention for people with known coronary heart disease – there are inequities in health care provision for these patients. Ecological level
The origin of health inequalities and their development is a complex interplay between structural, social and individual factors influencing both population health and individual health (Figs. 15.1, 15.2). Health care and health care systems can contribute to both individual and population health improvement, but also to reducing health inequality and health inequity, although the relationship between improving equity of access, for example, and reducing inequality of health outcomes may be complex and is often not well understood. It is evident from Figs. 15.1 and 15.2 that health system characteristics have potential for a targeted approach to tackling inequality. Although it may not be possible to eradicate health inequality entirely, the structure and processes of health care delivery can influence the degree to which it exists. In particular, the greatest impact may be on the systematic relationship with characteristics such as geography, ethnicity, socio-economic group, etc. [6]. Inequality in health care delivery and the corresponding equity in health care have been defined as: Health care is equitable when resource allocation and access are determined by health needs [7]
Researchers now distinguish horizontal equity, which exists when individuals or groups with equal need consume equal amounts of health care from vertical equity, which exists when individuals with varying levels of need consume appropriately different amounts of health care.
Aggregated individual level
Individual level
Occupational/ Environmental exposures Physiologic damage Material resources Material deprivation
Political and policy context
Fig. 15.1 Factors that influence health at the individual level. Figure reproduced with permission from [5]. Reproduce with the permission of the Pan American Health Organisation (PAHO)
Socioeconomic characteristics Developmental health disadvantage
Social resources Social isolation Behavioral/ Cultural characteristics
Psychosocial characteristics
Health system characteristics
Health
Stress
Health services received
Genetic/ Biological characteristics
15
How to Measure Inequality in Health Care Delivery
177
Environmental characteristics
Occupational/ Environmental policy
Social policy Political context
Equity in health Economic development Historical health disadvantage
Social characteristics
Economic policy
Health Behavioral/ Cultural characteristics
Health policy
Demographic structure Health system characteristics
Fig. 15.2 Factors that influence health at the population level. Figure reproduced with permission from [5]. Reproduce with the permission of the Pan American Health Organisation (PAHO)
Indeed, horizontal equity is a guiding principle of the UK National Health Service (NHS) to provide health care on the basis of clinical need. Nevertheless, there is mounting evidence of the “inverse care law” – provision unrelated to need. Examples include emergency procedures (appendicectomy and cholecystectomy), which are more common in deprived populations, as are tonsillectomies. Varicose vein surgery tends to be more common in less deprived areas (Fig. 15.3). Equally, we see inequality in surgical outpatient attendances between Primary Care Trusts across England (Fig. 15.4). The importance of the distinction between horizontal and vertical equity lies in the appreciation that although clinicians may treat the patients they see according to their individual needs, other factors may deter or encourage referral or attendance. Vertical inequity is really only detectable at a population level through comparison between populations or areas. For example, imagine two populations with an equal “need” for hip replacement based on a similar prevalence of disabling hip osteoarthritis. Imagine that one has greater social deprivation than the other such that people are less able to access health care (e.g. lack of affordable transport, inability to take time off work, less demanding population). The patients referred and
who attend orthopaedic clinics may have equivalent clinical need in both areas, but at population level, the rates of hip replacement may vary or the proportion of patients with osteoarthritis receiving a hip replacement may be lower in the more deprived population. The level of inequity can, therefore, only be determined, once “need variables” have been distinguished from “non-need variables”, as need should affect consumption of health care and non-need variables should not. More recently, Culyer and Wagstaff provide four definitions of equity in health care, equality of utilisation, distribution according to need, equality of access and equality of health. They conclude that, in general, these four definitions are mutually exclusive and practically incompatible, but correctly identify that each of these components needs to be aligned within the distribution of a health care service so as to get as close as is feasibly possible to an equal distribution of health [9].
15.1.1 Access to Health Care A component of health care delivery is, therefore, concerned with creating a health care system or
178
E. Mayer and J. Flowers Correlations
Index of multiple deprivation 2007
Pearson Correlatio Sig. (2-tailed) N
Cholecystectomy
.314*
Sig. (2-tailed)
.030
.269
.031
.908
.569
48
48
48
48
48
.295*
.477*
.241
.189
.455**
.220
.404*
.042
.001
.098
.198
.001
.133
.004
48
48
48
48
48
48
48
1
.270
.197
.279
.351*
-.185
.191 .192
1 48
.162
.042
48
48
.011 48
.063
.180
.055
.015
.208
48
48
48
48
48
48
48
1
.014
.292*
.375**
.174
.280 .054
Pearson Correlatio
.241
.477**
.270
Sig. (2-tailed)
.098
.001
.063
.923
.044
.009
.236
48
48
48
48
48
48
48
48
48
.163
.241
.197
.014
1
.437**
.063
-.115
.176
.269
.098
.180
.923
.002
.673
.438
.233
48
48
48
48
48
48
48
48
1
.141
-.090
.392* .006
N
48
Pearson Correlatio
.312*
.189
.279
.292*
.437**
Sig. (2-tailed)
.031
.198
.055
.044
.002
.341
.543
48
48
48
48
48
48
48
48
1
.022
.296*
.884
.041
N
48
Pearson Correlatio
.017
.455**
.351*
.375*
.063
.141
Sig. (2-tailed)
.908
.001
.015
.009
.673
.341
48
48
48
48
48
48
48
48
48
-.084
.220
-.185
.174
-.115
-.090
.022
1
.143
.569
.133
.208
.236
.438
.543
.884
48
48
48
48
48
48
48
Pearson Correlatio
.365*
.404**
.191
.280
.176
.392**
Sig. (2-tailed)
.011
.004
.192
.054
.233
48
48
48
48
48
N Pearson Correlatio Sig. (2-tailed) N Appendicectomy
.098
48
Sig. (2-tailed)
Sig. (2-tailed)
Varicose veins
Appendic ectomy .365*
.162
.295*
Knee replacemen Pearson Correlatio
Hip replacement
Varicose veins -.084
48
.205
N
Tonsillectomy
48
Knee Hip replacement Tonsillectomy replacement .163 .312* .017
.030
Pearson Correlatio N
Hernia repair
48
Pearson Correlatio N
Grommets
Index of multiple deprivation Cholecyst ectomy Grommets Hernia repair 2007 1 .314* .205 .241
N
.331 48
48
.296*
.143
1
.006
.041
.331
48
48
48
48
*. Correlation is significant at the 0.05 level (2-tailed). **. Correlation is significant at the 0.01 level (2-tailed).
Fig. 15.3 Relationship between deprivation and area admission rates (European standardised, 2005/6) for a range of common surgical procedures in a single region of England. Source: Hospital Episode Statistics [8]
environment in which there exists fair or equal opportunities for all people to attain health, i.e. equity of access to health care. The term “access” itself needs to be carefully defined as health care delivery occurs as a three tier system, hospital care, primary (community) care and public health [10]. For a service or health care system to be accessed, it needs at least to be: • Available – the service needs to be present • Accessible – e.g. service can be reached by public transport, is open when people can use it, is culturally sensitive to the population of patients likely to use the service and so on • Appropriate – the service needs to be relevant to the needs of the population. Research into inconsistency of health care access can be divided into three domains – patient level variables, characteristic and practices of health care professionals and the system of health care delivery [11].
Measures of health status and health inequality have traditionally focused on indicators of health outcome such as mortality, rather than direct measures of health status, although there are strong correlations between the two [12]. Frameworks for understanding health care structure, processes and patient outcomes have increased the awareness of the requirement to study measures of both the delivery and distribution of delivery of health care [13]. The understanding that health care delivery was linked to patient outcome, or health, was stratified by Donabedian in his 3 × 4 matrix over 25 years ago [14] (Fig. 15.5). Furthermore, equity and lack of variation in health care delivery (consistency in process) forms one of the “seven pillars of quality” as defined by Donabedian [15] Table 15.1. Health care inequality research is, therefore, critical to further our progress to improve the quality of care that our patients receive. “The notion of health care quality implies that resources are allocated according
15
How to Measure Inequality in Health Care Delivery
179
Fig. 15.4 Lorenz curve showing inequality in surgical outpatient attendances across PCTs in England 2007/8. Source: http://nww.nhscomparators. nhs.uk/
Resources
Process
Outcome
Access
Table 15.1 Equity in health care distribution forms one of Donabedian’s seven pillars of quality health care. Table adapted from figure in reference [16] Efficacy
The ability of care, at its best, to improve health
Effectiveness
The degree to which attainable health improvements are realised
Efficiency
The ability to obtain the greatest health improvement at the lowest cost
Optimality
The most advantageous balancing of costs and benefits
Acceptability
Conformity to patient preferences regarding accessibility, patient-practitioner relationships, amenities, effects of care and cost of care
Legitimacy
Conformity to social preferences concerning all of the above
Equity
Fairness in the distribution of care and its effects on health
Technical Quality Affect/Relationship Quality Continuity of care
Fig. 15.5 Donabedian’s 3 × 4 quality matrix linking health care delivery with patient outcome
to medical need, risk and benefit”, yet quality assessment tools have failed to address adequately health care inequality across socio-economic groups [17]. Understanding and uncovering inequality in health care delivery, therefore, aims to reduce inequality and inequity in health and simultaneously increase the quality of care for a greater proportion of the population. Few, however, have looked at the impact of health care access on health inequality [18]. Improving the equality of delivery can also have financial gain by decreasing inequality in health outcomes, and therefore, health care demand through fewer complications, fewer exacerbations of chronic illness, lower numbers of emergency re-admissions, etc.
The study of health inequality is a vast, complex and rapidly developing field. The focus of this chapter will relate to our current understanding of the measures and dimensions of inequality in health care delivery. For areas not covered in depth in this chapter, suitable references for further reading will be provided where applicable.
180
15.2 Dimensions of Inequality in Health Care Delivery As already discussed, there is a non-random distribution of health in the population; health problems tend to cluster systematically for some individuals and some groups of individuals [5]. The generation of average health statistics at the national or even local level will tend to ignore this clustering phenomenon, and therefore, hide important variations within the population of interest. This is one of the main arguments for the categorisation of groups within a population and then to explore their relationship with measures of inequality in health care delivery. Random inequity can arise, however, because of factors such as variations in medical practice and historical accident [19]. The majority of studies assess horizontal equity and not vertical equity, which is less commonly addressed and gives rise to more complex ethical issues, such as the conflict between trying to generate equality of delivery and the potential for removing patient lead choice and the generation of inefficiency of utilisation by equalising access. Many studies have been performed over the years and have led to some conflict in the reported results. We report here on the most contemporary of studies, which tend to use the richest of data sources and make better adjustment for concepts such as “need”, thereby reflecting on the most current position with regard to inequality in health care delivery.
15.2.1 Patient-Level Characteristics 15.2.1.1 Socio-Economic Characteristics Socio-economic inequalities in health result from both social causation mechanisms (behavioural, structural environmental and psycho-social factors associated with a low socio-economic status resulting in poor health) and social selection (poor health impacts negatively on education, employment, income, etc. to further lower socio-economic status) [20]. It has been proven that across countries there is a strong association between income inequality and health inequality, with inequality favouring the rich. This association is particularly true for the US and the UK [21]. Although
E. Mayer and J. Flowers
inequalities in health tend to originate outside of the health care system, it follows that mechanisms of health care delivery can impact to either worsen or improve inequality that might exist. Inequality in health care delivery can be measured across socio-economic factors such as income, education, occupational characteristics and social class. In the UK, as in other European countries, equity of availability of a broad range of health care services, regardless of income, has been achieved through taxation-funded health care system. This has not, however, eliminated socio-economic inequality in health care access or “realised access” [19]. Income-related inequality in doctor utilisation as measured by self-reported “number of doctor contacts” and adjusting for self-reported measure of need has been demonstrated for a number of European countries, using European Community Household Panel (ECHP) data. Wealthier and more highly educated individuals are more likely to have contact with a medical specialist than the poor after adjusting for need, despite the poor having greater needs. There appears to be little or no need-adjusted inequality for general practitioner (GP) services [22]. Interestingly, the most important socio-economic variables driving a higher need-unadjusted use of GP services by the poor were low education, retirement and unemployment. A more recent study using improved methodology for adjustment of need confirms the pro-rich estimates for specialist care and lessens any pro-poor inequity in GP care that was previously demonstrated [23]. Similar findings to those of the ECHP dataset have been found for the organisation for economic co-operation and development datasets, which now incorporates 30 member countries worldwide [24]. Variation across countries was seen in each of the studies. If one looks specifically at the UK results, some contradiction in the studies results are seen; van Doorslaer et al. [22] showed a slight pro-rich inequity in the need-adjusted probability of a GP visit with a more significant prorich inequity for specialist care, whereas van Doorslaer et al.’s alternative study using the 2001 British Household Panel Survey data showed no significant income-inequity for GP, medical specialist (outpatient) or hospital (inpatient) care utilisation. Morris et al. using the 1998–2000 Health Survey for England investigated inequality in the use of GP’s, outpatient visits, day cases and inpatient stays, adjusting for local supply conditions and subjective and objective measures
15
How to Measure Inequality in Health Care Delivery
of individuals need. Individuals with low-incomes were more likely to consult their GP, but less likely to have outpatient visits, day cases and inpatient stays. The same was true for individuals with lower levels of formal qualifications, except for outpatient visits where no association was found [25, 26]. Socio-economic inequity for specific surgical treatments or services such as total hip replacements has been demonstrated [27], although there is evidence that the inequality appears to have decreased slightly over a ten-year period between 1991 and 2001 [28].
15.2.1.2 Socio-Demographic Characteristics Ethnicity Using the General Household Survey dataset, Smaje and Grand [29] demonstrated a trend towards equivalent or higher GP use and lower outpatient use relative to Whites among most ethnic groups. Only for the Indian and Pakistani groups was there a significantly greater utilisation of GP services. The relationship was unaltered after adjusting for income or socio-economic group. The major limitation, however, was the inadequate adjustment for varying degrees of “need” between ethnic groups. Perhaps the most interesting result was the consistent finding of lower levels of outpatient use by “non-sick” individuals across all the minority ethnic groups as compared to White patients. Morris et al., using more robust regression analysis, demonstrated similar results, with a trend towards nonWhite ethnic groups utilising GP services less than Whites, but outpatient services more than White. The only significant results were for the Indian group and Pakistani, Bangladeshi and Chinese groups, respectively. There were no significant differences for the use of day case services or probability of inpatient stay, except for the Chinese who were 45% less likely to have had an inpatient stay [26]. NHS Direct, an emergency telephone advice service, was launched to help improve access to health care. It may, however, have inadvertently increased ethnic inequity in health care delivery, with lower utilisation levels by non-White patients and those born outside of the UK [30]. “Need”, however, was only defined as limiting or non-limiting long-term illness; this limits extrapolation of the findings, as NHS Direct likely serves a significant population with acute illnesses.
181
Gender and Age Summary studies have demonstrated that females, irrespective of age, are significantly less likely to utilise GP visits, day case treatment and inpatient stays [26]. Age was shown to have non-linear effects on the probability of the use of the same health care services, but also including outpatient visits. This incorporated both conditional (effect of age on use, keeping all other factors constant) and unconditional (incorporates effects of other variables which affect use and are correlated with age such as morbidity) estimated relationships. Interestingly, Allin et al. [31] studied horizontal inequality of utilisation of health care services in only patients aged 65 and over in the UK using the British Household Panel Survey data (1997–2003). They found that those on a lower income were significantly less likely to visit a GP or a specialist in outpatients despite having the greater need, i.e. pro-rich horizontal inequity. Households with older residents are also less likely to use the UK’s nationally available emergency telephone advice service, NHS Direct [30]. For more specifically-orientated studies, women were found to be twice as likely to need hip replacements, but equally likely to be receiving care as males. Whereas older people, despite a greater need for total hip replacements, were less likely to be receiving care than younger patients [27]. Shaw et al. 2004 using Hospital Episodes Statistics (HES) data showed that women, in addition to older people in England, are probably receiving less revascularisation in terms of coronary artery bypass grafting and percutaneous transluminal coronary angioplasty than their need, as defined by admission rates for acute myocardial infarction, would indicate. The authors do, however, accept the limitations of the data in terms of being able to adjust adequately for clinical severity and indications for treatment [32].
Geography The variation across countries for overall utilisation of health care services has already been described [22]. Within countries, geographical distinct groups have been commonly used as comparators to identify potential inequality. Equitable geographical distribution of health care carries obvious political incentive. Patients living in the UK urban areas consult GP’s more commonly than rural areas, whereas in the US
182 100% 90% Cumulative % interventions
this imbalance is not seen [33]. Although the causes for the differences seen in the UK are complex, evidence suggests an influence of a range of personal factors that affect an individual’s intention to consult a GP, irrespective of location; this may be particularly true for “out-of-hours” access [33]. In a study conducted in the north and northeast of Scotland, there was significant trend between the presence of disseminated disease at presentation of patients with lung cancer and colorectal cancer and an increasing distance of residence from a cancer centre. This may not, however, be attributable to limitations in health care delivery, as women with breast cancer, who lived further from cancer centres, were treated more quickly, but only as the result of them receiving earlier treatment at non-cancer centre hospitals [34]. No difference was seen for patients with colorectal cancer. In a study looking at the effects of rurality on need adjusted use of health services for total hip replacement, it was shown that patients living in rural areas and in need of total hip replacement were as likely to access GP or hospital care, or be on a waiting list, as patients in urban areas [27]. More local level analysis can demonstrate inequalities in health care delivery. Dixon et al. showed the presence of regional variation (across eight administrative NHS regions of England) in age and sex-standardised hip and knee joint replacement rates [35]. There are, however, many patient and service level factors that can explain the variation seen; the most important for total hip replacement was the proportion of older patients aged 65–84 years old (this was despite the data being age-standardised). For total knee replacement, over 50% of the regional variation could be explained by the number of centres in each region offering surgery. Returning to the principle of horizontal equity of health care provision, where intervention rates (supply) for a defined disease process should appropriately increase as need (demand) increases. Analysis of the relationship between surgical intervention for lung cancer and lung cancer incidence across the Primary Care Trusts within a Strategic Health Authority has shown little correlation [36]. Furthermore, it has shown that there is also the presence of inequity, with 50% of the interventions being performed in the 40% of the areas with the highest lung cancer incidence (Fig. 15.6). The exact reasons for the inequality seen between socio-economic and socio-demographic groups is not known, but may reflect factors such as disadvantaged individuals having more and/or multiple co-morbidities,
E. Mayer and J. Flowers
80% 70% 60% 50% 40% 30% 20% 10% 0% 0%
20%
40%
60%
80%
100%
Cumulative % population ranked in order of decreasing lung cancer incidence
Fig. 15.6 Concentration curve for lung cancer incidence. Red line shows expected curve if intervention rates were most concentrated in areas with highest need. Reproduced with permission from reference [36]
thereby restricting their suitability for or benefits from surgery and increasing the risk of post-operative complications and poorer long-term outcomes [28]. It may also be related to variations in “demand factors” where disadvantaged individuals could have lower overall expectations with regard to the treatment that is available to them, or for cultural or educational reasons be less able to access the correct level of care appropriately and understand medical system and/or information provided [28]. It is also possible that there are “supply factors” at play, with longer waiting lists and less capacity to undertake operations in hospitals located in more deprived areas. Variation in medical practitioner’s clinical practice for diagnosis, investigations, timing of referrals and indications for operation can occur geographically or be affected by deprivation [28, 37]. These supply factors will now be discussed in more detail.
15.2.2 Primary (Community) Care Characteristics Hippisley-Cox and Pringle looked at several characteristics of primary care facilities that may impact patients’ access to coronary angiography and revascularisation. Factors that lowered angiography and revascularisation rates were primary care practices that were more that 20 km from the revascularisation (referral) centre and
15
How to Measure Inequality in Health Care Delivery
those treating a more deprived patient population. The same variables were also related to longer waiting times for angiography [38]. Interestingly, patients from fundholding practices had a higher admission rate for coronary angiography. Using the cardiovascular disease Quality and Outcomes Framework (QOF) indicators, Saxena et al. assessed the relationship between practice size and caseload and the quality of care they provided as defined by the 26 QOF indicators. They found that generally the quality of care was consistently high regardless of caseload or practice size, except for selected indicators related to early diagnostic investigations and management, when among smaller practices with lower caseloads and in more deprived areas, there was lower quality of care [39]. The authors suggested that this may reflect better access to resources among larger practices, possibly as a result of local planning and commissioning where higher caseloads exist. A related study looked at quality of care provided by primary care practices based on their performance across 147 QOF indicators across eleven chronic disease processes. They demonstrated that practices located in areas with less social deprivation, training practice and group practice status were all independently associated with achieving higher QOF scores [40]. Both the Saxena et al. [39] and Ashworth et al.[40] studies demonstrate that primary care practices can achieve high levels of QOF achievement in less deprived areas by delivering resources to potentially low-risk patients, and therefore, overlook those at higher risk and in most need of the resources.
15.2.3 Secondary (Hospital) Care Characteristics There has, for many years, been a distinction between the “specialist” hospital and the “general” hospital and the ongoing debate as to whether patient outcomes differ between them. Pertinent to this question is the characteristics of the health care that they deliver and whether there exists inequality in their differences, as this will clearly impact subsequent quality of care that they can provide. Cram et al. reported on an assessment into the rate of adverse outcomes between speciality orthopaedic hospitals and general hospital for Medicare beneficiaries undergoing either hip or total knee replacement. Although speciality hospitals displayed better riskadjusted patient outcomes, they were treating patients
183
who had less co-morbidity and resided in more affluent geographical areas [41]. Even within the speciality hospitals group, those that were physician-owned were found to be treating patients with fewer co-morbidities than non-physician-owned centres and patients who underwent major joint replacement in physician-owned speciality hospitals were less likely to be Black, despite the hospitals being located in neighbourhoods with a higher proportion of Black residents [42]. The lower mortality following cardiac revascularisation procedures seen in speciality cardiac hospitals as compared to general hospitals has been shown to be partly explainable by the better health of patients that they treat [43]. Inequality of health care delivery as a result of hospital characteristics such as that described can have a significant impact on the outcomes of patients. Patient outcomes are not just determined solely by “patientbased risk”, but also indirectly from the improved hospital processes that can result from a more efficient and productive patient “throughput” because of less intensive care admissions, shorter lengths of stay, fewer complications, etc. As these factors are partly out of control of the patient, they could be said to be inequitable. Indeed, there is some evidence to suggest that supply incentives could further worsen inequity as a result of “physician-induced demand”, which can increase utilisation more than would be explained by simply providing facilities for increased capacity [44]. In the UK health care system, a significant secondary care characteristic that needs to be considered is independent sector health care access and utilisation. For surgery, “private” operations can account for a significant proportion of those undertaken in the NHS, such as hip replacements, which account for about a quarter of England’s total caseload. Surgery undertaken in the private sector occurs as a result of long NHS waiting times, and therefore, tends to be concentrated in more affluent areas of the UK such as the Southeast of England. Aside this causing inequality in itself, it leads to underestimation of socio-economic inequality in the NHS sector because of unobserved activity in less socio-economic deprived areas [28].
15.2.4 Limitations of Inequality Research Inequality research reveals associations between dimensions of inequality and health care delivery. It does not, however, reveal causal relationships. It is unlikely that a
184
E. Mayer and J. Flowers
single association, if reversed, will eliminate inequity as a result of the complex interaction of numerous factors in the generation of inequality and inequity. This having been said, the demonstration of associations will allow for more focused research to then identify the most likely causal relationships to be acted upon. Although the research to date has produced many interesting results, it has also caused some confusion as often there are contradictions between studies in terms of the direction of association, or whether or not a true association exists. This contradiction is often the result of methodological differences between studies; these differences are described in detail by Dixon et al. [45], but include factors such as use of self-reported morbidity over objective measures, using only inpatient stay data and not including day case data, a focus on only emergency admissions and not addressing issues of appropriateness of care. Differences in reported results between studies over time could additionally be the result of sampling errors, changes to the geographical distribution of health care resources and the growth and use of the private sector. Dixon et al. suggest that in terms of these methodological limitations, a hierarchy of evidence quality results with micro-studies best of all, macro-studies with more disaggregated indices of need and utilisation next, then followed by remaining macro-studies. Goddard and Smith also identify the methodological limitations of studies, which in turn makes it difficult to make firm conclusions about inequities in the access to health care, identify potential causes and recommend appropriate policies to reduce them [19]. These methodological limitations were highlighted after the authors had set out a general theoretical framework within which the equity of health care access can be researched [19]. This framework draws
Table 15.2 Internet resources for inequality in health and health care Organisation URL International Society for Equity in Health
www.iseqh.org
World Health Organisation – Health Systems Performance
http://www.who.int/ health-systemsperformance/
European Public Health Alliance – Health inequalities
http://www.epha.org/r/50
Determine – European Portal for Action on Health Equity
http://www.healthinequalities.org/
together several ambiguous concepts such as “need”, “access”, “utilisation”, “demand” and “supply” and provides a useful platform for the alignment of future research in this area. Table 15.2 provides links to organisations and their websites, which have a focus on health and health care inequality.
15.3 Measuring Inequality in Health Care Delivery There are two parts to any health care inequality measurement: • The aspect of health care of interest, e.g. structure, process, outcome, utilisation or access. • The precise metric used to assess the level of inequality or inequity. To facilitate understanding, it helps to be precise about both aspects.
15.3.1 What to Measure – Measuring Health Care Following the approach of Donabedian or others, we can identify a number of ways of measuring aspects of health care: • Structure – provision of services, staffing levels, numbers of beds and so on • Utilisation – referral and attendance rates, procedure rates, admission rates • Outcomes – hospital mortality rates, adverse event rates, mortality rates, survival rates Two main approaches have been described for measuring inequality in health care delivery: summary “macro” studies that measure utilisation of NHS services at a “high level” such as numbers of GP visits, specialist visits and A&E attendances, etc., and more specific “micro” studies which focus on particular diagnoses or treatments [45]. Before such studies can be performed, more fundamental decisions need to be taken in choosing the methodology employed. The following section discusses features important to the assessment of inequality in health, the area in which they were first developed, though they are equally applicable to measuring inequality in health care delivery.
15
How to Measure Inequality in Health Care Delivery
185
15.3.2 What to Measure – Measuring Health Inequality and Inequity
health inequality. Those who believe in measuring health inequality in a univariate fashion, also known as “pure inequalities in health”, criticise the bivariate approach of “evaluating health differences between categories or values of a socio-economic characteristic because they involve a moral judgement”. There is also the concern that the bivariate approach may not reflect inequalities across individuals in the population [47]. Indeed, these socio-economic characteristics should be “part of the process of explaining health inequalities, not part of the process of measuring them” [6]. Proponents of the bivariate approach believe, to the contrary, that health inequality should be assessed at the socio-economic level as it forms a fundamental component of determining inequity where it exists. It is argued that the “higher level” approach of univariate measurement can fail to identify inequality where it exists at a more local level between socio-economic groups. Both arguments prevail and whether a univariate or bivariate is used will be determined by the question posed. Figure 15.7 illustrates the additional information adduced by analysing the data by socio-economic status. It is useful when trying to identify the potential for inequality in a study population of interest to first use a summary statistic approach. These can be generated using comparator groups as categorical variables or
Health or a measure of health is distributed throughout a study population. The distribution of health can be described in two ways: its location (a measure of central tendency) or its dispersion (variability in the distribution) [46]. Measures of dispersion can either be considered in a univariate or bivariate fashion. Univariate measures of dispersion solely assess the variability in distribution of a health measure across the study population. Bivariate measures of dispersion assess the variability of a health measure in conjunction with a secondary variable, such as a socio-demographic factor. As example, assessing the screening rate for breast cancer across Primary Care Trusts in England and Wales is a univariate measure of dispersion of health care delivery. Assessing the relationship between screening rates for breast cancer and social class across Primary Care Trusts is a bivariate measure of dispersion. When measuring dispersion, the unit of interest within the study population can either be at the level of the individual, or as in the above example, among groups. There are two groups of researchers split by their differing opinion of what should be measured in terms of
Fig. 15.7 Map showing small area variation in life expectancy at birth within a county area in England (most deprived areas are outlined) (left), and chart showing trend in life expectancy of county as a whole and most and least deprived fifths of areas (right). The life expectancy in the most deprived areas is significantly worse than the area as a whole, a fact not obvious from the univariate analysis represented by the map. Source: Eastern Region Public Health Observatory. http://www.erpho.org.uk/ Download/ Public/17088/1/2007%20 HIP%20Cambridgeshire%20 PCT%20county.pdf. Accessed July 2008
186
continuous variables and can additionally use the univariate and bivariate approaches as described above. The method by which the data are handled determines the statistical approach used to identify inequality, where it is present, and how the results are displayed. As identified by Carr-Hill and Chalmers-Dixon [48], three questions first need to be answered when devising a method to measure inequalities: 1. Which units of interest are to be compared? 2. What type of inequality is it that you are interested in? 3. What is the intended purpose for the results generated? When the data are handled in a categorical fashion, differences in a measure of health between defined groups within a population are sought, known as “health gap” analysis. The differences can either be absolute, or relative, or used in combination depending on the question being answered. Health inequality can also be measured across individuals, thereby treating the unit of interest as a continuous variable. These are known as share-based measures and are generally considered more complicated than health gap analysis. The applicability of these different approaches to measuring health inequality will be explored in further detail later in the chapter. In thinking specifically about health care delivery, this includes both provision (access) which can be measured across geographically defined groups acting as a univariate measure, and patient uptake. In line with definitions of horizontal equity, there should be equality of access to health care services for individuals with equal need. Access should, therefore, be independent of dimensions of inequality such as socio-economic status, ethnicity, etc., except where this affects need [45]. Patient uptake (utilisation) of health care services may be more directly influenced by socio-economic group, gender, ethnicity, etc.; this is more suited to bivariate analysis. There is no point equalising access if you have inequitable uptake, as inequality in health will still prevail. However, inequality of utilisation cannot always be said to be inequitable, as individuals or groups of individuals may have equal need and equal access, but choose not to utilise the health care services for whatever reasons [45]. Distributions that are the outcome of factors beyond individual control are generally considered inequitable; distributions that are the outcome of individual choices are not [49].
Health care delivery is thus composed of the two elements’ access and utilisation: Does inequality in health
E. Mayer and J. Flowers
care delivery indicate a resource issue, or does it represent variation in demand? To measure true inequality in delivery, one needs to identify a shortfall in the supply/demand ratio. As identified by Goddard and Smith, variation in supply (access) of health care can evolve from reasons of availability, quality, costs and information and communication [19]. Data sources and the health outcomes that they record do not generally give an indication to what extent the utilisation, or indeed lack of it, results from the individuals’ choice. Researchers generally accept that any demonstrated inequality in health care delivery results from inequality in access, which is, therefore, inequitable [45]. The second element in the definition of horizontal equity that is far from obvious is “need”. Two definitions of need are commonly presented: current illhealth or severity of illness and “capacity to benefit”. Although the latter incorporates the concept that individuals who are currently not ill can still benefit in terms of their future health by providing preventative health care, it is measuring the entity that care will affect (health), rather than the entity that is needed (health care) [6]. Indeed, if need is defined as capacity to benefit, individuals presenting early in the course of illness have greater need, whereas if need is defined as current ill-health or severity of illness, individuals presenting later will have the greater need [45]. Need is most often defined as current ill-health because it is more readily available as a data source.
15.3.3 Data Sources Datasets collected and collated, which are intentioned for mapping inequality or can be used for this purpose, are done so either at the international, national or local level. A limitation of national datasets is that they are intended to allow comparison between aggregated groups and may, therefore, be ineffective for lower-level analysis. Equally, local level datasets are often restricted for use in more extensive comparison because of heterogeneity in data definitions and methodology of collection and formatting. Researchers may, therefore, be limited in their analysis by the data available to them. There are data sources encompassing many dimensions of health inequality, such as social care, deprivation, education, unemployment, environment, crime and income and benefits. These are covered extensively in
15
How to Measure Inequality in Health Care Delivery
The Public Health Observatory Handbook of Health Inequalities Measurement [48] and the reader is directed to this reference for further reading as they fall outside of the scope of this section. The choice of data sources will be governed partly by the type of study being carried out. Broader “macro” studies will typically use large-sample household surveys, such as the UK’s General Household Survey carried out by the Office for National Statistics or the British Household Panel Survey carried out by the UK Longitudinal Studies Centre at the University of Essex. Both surveys collect and collate information at an individual and household level on a range of socio-economic, socio-demographic variables and include information on health status and use of health services. An international example of such a survey would be the ECHP, which, using a standardised questionnaire, annually interviews a representative sample of households and individuals in each of 14 member countries. More specific “micro” studies use one of three possible types of data as defined by Cookson et al. [28]: • Specialist small-sample patient survey data, with detailed information on condition-specific need • Administrative data on the use of specific procedures linked at individual level to detailed information on socio-economic status and need from patient records or specialist surveys • Administrative data linked at small area level to socio-economic, demographic and other small area statistics – often with no information about need other than population demographics Example of a specialist small-sample patient survey data would include a study by Malin et al. [50]. This study analyses adherence to quality measures for cancer care, including delivery of care specifically for patients with a new diagnosis of either stage I–III breast cancer or stage II or III colorectal cancer. The incorporation of clinical domains representative of the entire patient episode is unique: diagnostic evaluation, surgery, adjuvant therapy, management of treatment toxicity and post-treatment surveillance. Further examination was made of eight components of care integral to these clinical domains: testing, pathology, documentation, referral, timing, receipt of treatment, technical quality and respect for patient preferences. In all, adherences to respectively 36 and 25 explicit quality measures with clinically detailed eligibility criteria, specific to the process of cancer care, were identified for breast and colorectal cancer. Overall
187
adherence to these quality measures was 86% (95% CI 86–87%) for breast cancer patients and 78% (95% CI 77–79%) for colorectal cancer patients. Subgroup analysis across the clinical domains and components of care did, however, identify significant adherence variability: 13–97% for breast cancer and 50–93% for colorectal cancer. Detailed information on condition-specific need was determined from hospital cancer registries, patient surveys and relevant medical records. Example of the second type of data would include the Health Survey for England, which is a series of specialist annual surveys. It contains a “core” component of information such as socio-economic and socio-demographic details, general health and psycho-social indicators, use of health services and measurement of height, weight and blood pressure. In addition, each year, there is a specialist module on a single topic, several topics or about certain population groups, which are assessed using directed questionnaires, physical measurements as well as other relevant objective measures such as analysis of blood samples, echocardiogram readings and lung function tests. Specialist modules to date have looked at cardiovascular disease, asthma and accidents, and children, the elderly and ethnic groups [51]. HES is a data warehouse that contains information about hospital admissions and outpatient attendances in England [8]. HES is representative of administrative data which can be linked to small area demographic statistics, but does not contain specific data on need. Admitted patient care is available from 1989 to 1990 and outpatient care from 2003 to 2004 onwards. HES is derived from the patient administration systems of health care providers. Most recently, responsibility for collation of HES was given to the National Programme for IT’s Secondary User Survey, a recently developed secure data environment which will facilitate actions including health care planning, public health, benchmarking and performance improvement. Although HES does not directly collect data on delivery of health care, the patient level record of admissions, operations and diagnosis that it does collate allow us to begin to look for areas of potential inequality which warrant further investigation. HES also uses a unique patient identifier and includes postcode data; this allows for data within HES to be linked to other geographically defined variables such as deprivation scores. In addition to HES, there also exist broader performance measures of hospital activity collated by the Department of Health as part of their required returns.
188
These Hospital Activity Statistics [52] look more directly at processes concerned with delivery of health care and measure factors such as: • Waiting times of patients with suspected cancer and those subsequently diagnosed with cancer at NHS Trusts in England • Waiting times for operation, elective admission, diagnostics and first outpatient appointment at both the provider and commissioner level • Bed availability and occupancy rates • Day care provision • Cancelled operations • Critical care facilities Initiatives that are in the process of being fully implemented, such as choose and book [53] and 18 week patient pathway programme [54], will all provide more data relevant to identifying inequality in health care delivery. Many of the datasets that are accessible for research into inequality in health care delivery have arisen from governmental policy aimed at either tackling inequality or improving the quality of health care by setting targets against which health care providers’ performance is compared. An increasingly highly competitive NHS under increasing scrutiny from both government and the public to improve transparency and accountability has resulted in the generation and availability of large amounts of information, some of which will now be highlighted. In response to the government’s national inequalities targets for life expectancy and infant mortality outlined in 2001 [55], the London Health Observatory on behalf of the Association of Public Health Observatories developed the local basket of inequalities indicators [56]. This incorporated an extensive review of hundreds of health inequality measures and indicators available in the public domain; they were then compared against explicit criteria targeted at improving their effectiveness in benchmarking, thereby assisting local areas with monitoring progress towards reducing health inequalities. 70 indicators were grouped into 13 categories; two of these “access to local health and other services” and “tackling the major killers” relate in whole or in part to inequality in health care delivery. The basket of indicators provides a framework against which inequality of health care delivery can be assessed, but the authors acknowledged that the indicators are more weighted towards health outcomes and determinants and that there is a paucity of indicators closely related to delivery that can be expected to change in the short term.
E. Mayer and J. Flowers
The QOF [57] was first initiated in 2003 and contains a set of targets across four domains: clinical, organisational, patient experience and additional services, against which primary care practices are assessed and subsequently financially remunerated. The QOF data can be linked at practice level to secondary care data and local population socio-economic and demographic data, at a Primary Care Trust level, to establish differences between primary care practice performance and patient outcomes [58]. This data source may allow for future analysis of the interaction between primary and secondary care sectors and subsequent health care delivery. Perhaps the most important part of QOF is the development of disease registers for a range of chronic diseases which give a good estimate of disease prevalence and useful proxy measure of need, which can help health equity audits.
15.3.4 Limitations of Data Sources Much of the limitation surrounding available data sources stems from the ambiguity surrounding definitions of “need,” “access,” “utilisation” and the indicators which best represent these definitions; this is most pertinent for “need” which has both objective and subjective components. This inevitably leads to difficulties in comparing the results of studies due to their varying definitions and indicators used. The comparison of studies over time that use identical data sources is also complicated by changes in geographical boundaries of health care sectors, such as that seen in the UK in 2006, when Strategic Health Authorities and Primary Care Trusts were reduced from 28 to 10 and 303 to 152, respectively. While this makes it difficult to compare the results of data sources, and therefore, studies either side of these changes, it can also complicate the linkage of data sources. Even at a hospital level, changes in the commissioning process can “redefine” the catchment population of a Trust, and therefore, change the socio-demographic groups that it provides for. There are two recognised concerns with the use of administrative databases for inequality research. The first is with the potential for coding error that can obviously affect the validity of results. Although this was legitimate in previous years, it is true to say that we have in recent years seen much better coding completeness as a result of greater number of studies using the datasets and
15
How to Measure Inequality in Health Care Delivery
so identifying inadequacies and also initiatives such as payment by results, which have incentivised health care providers to pay more attention to the coding process and its accuracy. The second limitation of administrative datasets is the paucity of data related to appropriateness of care or case-mix adjustment. Techniques to overcome the case-mix adjustment issue have been developed using the Charlson index, although it was not initially developed for this purpose. Appropriateness of care has been dealt with by some US studies by using the linked data of the Surveillance Epidemiology and End Results Program of the National Cancer Institute with the administrative Medicare files database of health care services contact. This allows for adjustment of data with respect to cancer grade and stage, and therefore, to explore if disparities in health care delivery such as mammographic screening affects overall and stage-specific survival [59]. In the UK, 12 population-based cancer registries collect information about new cases of cancer and produce statistics about incidence, prevalence, survival and mortality. Better future linkage of these cancer registries with administrative datasets such as HES will help to further research into the effectiveness of local cancer service provision. There are acknowledged limitations with the data sources currently used for inequality research, but not all of them are as a direct result of the process of data collection, and merely reflect the difficulties in agreeing on consensus definitions of the endpoints to be measured. Resolution of this issue alone will significantly improve future research efforts.
189
continuous or categorical fashion and whether the dispersion of health care delivery is measured in a univariate or bivariate manner. Summary measures are a useful tool to display health inequality, its extent and variation. The more commonly used measures will be described in more detail below. For a more detailed discussion and mathematical description of measures of health inequalities, the reader is referred to references [47, 60].
15.4.1 Health Gap Analysis Health gaps are measures of the relative or absolute difference in health status between different groups in the population. They can be reported in a number of ways, including a range (difference between the “best” and “worst” or highest and lowest) and relative rates (rate ratio), e.g. relative mortality rates between two health care providers for a specified operation and absolute differences, with the national or population average used as the comparator. Figure 15.8 illustrates the various methods of reporting for health gap analysis. Health gap analysis in this form uses a categorical approach for the
Proportion of population
A
B C
D
15.4 Methods for Measuring Health Care Inequality The methods developed and used for measuring inequality have been applied to the analysis of health outcome or indexes of health, with little application for measuring health care delivery. They are, however, directly transferable and the indications and limitations for the different statistical approaches hold true. Inequality can either be measured at a single point in time, or changes can be measured over time. It is important to remember that if a health indicator is measured over time, it must be done so with reference to a comparator, otherwise health gain and not inequalities are being assessed. The methodological approach used will be guided by whether the health measure of interest is handled in a
Fig. 15.8 Demonstrates the distribution of a simulated health measure throughout the population. The range is the difference between highest and lowest (B–A), the rate ratio can be described by B/A and the absolute difference with an average “national” comparator can be described as D–C. Source: Eastern Region Public Health Observatory. http://www.erpho.org.uk/Download/ Public/6949/1/Measuring%20and%20monitoring%20 health%20inequalities%20locally%20version%202.doc. Accessed February 2008
190
health measure of interest and the dispersion of health care is handled in a univariate manner. Health gap analysis can also be reported for a health measure categorised into groups in conjunction with a secondary variable, i.e. handled in a bivariate manner. Under these conditions, it is typical to report the ratio of or difference between the rates of each group. Bivariate handling of data can also be used to measure differences among groups by means of multilevel analysis, where the unit of observation is the person, but groups’ variables are included [61]. A degree of caution needs to be exercised when interpreting health gap analysis. The absolute difference may vary even when the relative difference is constant. This can be illustrated by four trusts (W, X, Y and Z) that report on their readmission rates following total hip replacement. For illustration purposes, the readmission rates are as follows: Trust W 10%, Trust X 20%, Trust Y 2% and Trust Z 4%. The relative differences between Trusts W and X and Y and Z are both 2 (20/10 and 4/2). However, in absolute terms, Trust Z only has a 2% higher readmission rate than Trust Y, whereas Trust X has a 10% higher rate. In addition, situations can arise whereby as the absolute difference decreases (such as when the frequency of the health outcome is low), the relative difference increases. If we think about the same four trusts reporting on their mortality rate following coronary artery bypass surgery: reported rates are Trust W 0.5%, Trust X 0.25%, Trust Y 2% and Trust Z 3%. While comparing Trust Y with Trust Z, there is an absolute difference of 1%, but a relative difference of 1.5. Trust W and X, however, have an absolute difference of only 0.25%, but a relative difference of 2. Authors have suggested, therefore, that it might be preferable to report both absolute and relative differences [60].
E. Mayer and J. Flowers
15.4.2.1 Lorenz Curve and Gini Coefficient The Lorenz curve is derived by plotting cumulative health share against cumulative population share, (Fig. 15.9), where the values are ranked in order of rate and are, therefore, a univariate measure of inequality in a distribution of health. If resources were equally distributed throughout the population, the bottom 20% of the population would have 20% of the resource, 40% of the population, 40% of the resources and so on. This is represented by the diagonal line on the Lorenz curve. Unequal distributions will have a curve – the nearer the curve to the diagonal, the greater degree of equality; the further away, the more inequality. If we are plotting say, deaths against population, the slope of the Lorenz curve is the death rate. The Gini coefficient denoted as “G” is a numerical summary of the Lorenz curve. Looking at Fig. 15.9, the coefficient is calculated as A/(A + B). The resulting value is between 0 and 1, where 0 represents perfect equality and 1 perfect inequality. 15.4.2.2 Concentration Curve and Concentration Index The concentration curve is a variation on the Lorenz curve and likewise treats the unit of interest as a Cumulative % health
15.4.2 Share-Based Measures A
Share-based measures are summary measures of the whole distribution of a resource, i.e. treating the unit of interest as a continuous variable. They compare the cumulative proportion of the resource with the cumulative population among which that resource is shared. As for health gap analysis, share-based measures can either be univariate or linked to a secondary variable such as socio-economic deprivation to make then bivariate.
B
Cumulative % population
Fig. 15.9 Lorenz curve generated by plotting cumulative health measure (y-axis) against cumulative population (x-axis). Gini coefficient derived with formula A/(A + B)
15
How to Measure Inequality in Health Care Delivery
continuous variable. It plots cumulative health share against cumulative population and the values are ranked by an external variable, instead of in order of rate as for the Lorenz curve. The external variable is usually, but not necessarily, decreasing deprivation or socio-economic status. This method of assessment, therefore, considers the measure of dispersion in a bivariate fashion. The concentration index (C), calculated in the same way as the Gini coefficient, can take values between −1 and + 1. This index summarises the socio-economic (SE) (or whichever external variable is chosen) gradient in the health measure of interest. A value of −1 indicates that all the health/ ill -health is concentrated in the worst off, + 1 shows an inverse SE gradient and 0 shows no SE gradient. A negative C value corresponds to the curve sitting above the diagonal and vice versa. The further the curve is from the diagonal, the greater the degree of health inequality.
15.4.2.3 Slope and Relative Index of Inequality (SII and RII) This has been used to summarise SE gradients and is calculated from a regression line drawn through a health measure stratified by a measure of socio-economic status, e.g. social class. It is mathematically related to the concentration index that has been described above. The SII differs from the concentration curve in its construction because it plots a measure of health (e.g. mortality rate) of defined population groups (e.g. primary care trusts) against the relative ranking of that group according to a deprivation indicator (the relative rank is calculated to be a value between 0 and 1). A regression line is then calculated and used to define the absolute difference (health gap) in the health measure chosen across all groups (the estimated SII). A relative gap (RII) can be calculated by dividing the absolute health gap into the average level of health across all groups and is usually expressed as a percentage value [62].
191
population subgroup) is already incorporated into the denominator (the overall population). This has been described as overlap and means that two independent quantities are not being compared [63]. As described by Hayes and Berry, ignoring this overlap between the subgroup and overall population can have profound effects on results “… the significance tests are conservative and correspondingly the confidence intervals are too wide. If the result is statistically significant ignoring overlap, it is even more significant after allowing for the overlap, whilst a non-significant result could become significant at a specified significance level”. Adjusting for overlap is relatively simple if a complete dataset is available, by comparing the subgroup with the remainder of the population. If a complete dataset is not available, a correction factor can be used to adjust for the overlap effect. Under certain circumstances (the result is already significant or the subgroup is 30% of patients. Finally, in a cost-analysis model, PDT was shown to be more expensive than esophagectomy. Consequently, there is little to support the routine use of PDT for patients with dysplastic Barrett’s or adenocarcinoma.
797
63.3 Techniques Within Specialty 63.3.1 Obesity Surgery Techniques Obesity has been dubbed a “global epidemic” by the World Health Organisation. Estimates suggest that more than 12 million adults and 1 million children in England will be obese by 2010 if no action is taken [24]. A report by the UK Government Chief Scientist estimates that by the year 2050, 60% of men and 40% of women will be clinically obese, at a cost of £45.5 billion per year [25]. Obese patients have an increased incidence of cardiovascular disorders including hypertension and Type II diabetes, with a variety of respiratory complications including obstructive sleep apnoea (OSA), symptoms of dyspnoea, obesity hypoventilation syndrome and bronchial asthma. Obesity is associated with an increased risk of gallstones, osteoarthritis and cancers of the breast and colon [26].
63.3.1.1 Obesity and All-Cause Mortality There is conflicting evidence regarding the indices of obesity and prediction of all-cause mortality. Price et al. published the findings from a cohort study in the United Kingdom of 14,833 patients aged 75 years or older, with a median follow-up of 5.9 years. No association was seen between baseline waist circumference measurement and, either all-cause or cardiovascular mortality. In nonsmokers (90% of the cohort), there was a negative association between body-mass index (BMI) and allcause mortality risk in both men (P = 0.0001) and women (P < 0.0001), even after adjustment for potential confounders. For the same cohort, waist-hip ratio (WHR) was positively but weakly associated with mortality risk in men (P = 0.579), whereas the relation was significantly positive in women (P < 0.0001). BMI was not associated with circulatory mortality in men (P = 0.667) and was negatively associated in women (P = 0.004). WHR was positively related to circulatory mortality in both men (P = 0.001) and women (P = 0.005) [27].
63.3.1.2 Surgical Treatment of Obesity Current guidelines from the National Institute for Clinical Excellence recommend bariatric surgery as a
798
treatment only for adults with a BMI greater than 40 kg/m2 or a BMI between 35 and 40 kg/m2, with coexisting significant disease such as type II diabetes, hypertension and OSA. Patients must also have attempted all appropriate medical treatment of their obesity for at least 6 months and failed to achieve or maintain adequate, clinically beneficial weight loss. The consensus view is that surgery should only be offered in the context of a multidisciplinary team including dieticians, psychologists, surgeons and allied health professionals that can provide special expertise in treating obese patients [28]. In the United States, the estimated number of patients discharged from inpatient facilities with a diagnosis code of morbid obesity and who underwent a bariatric procedure increased from approximately 5,000 per year in the early 1990s to more than 100,000 per year in 2005 [29]. Bariatric operations can be classified as either restrictive or malabsorptive, or a combination of the two approaches.
Restrictive Procedures Restrictive operations reduce the storage capacity of the stomach, creating a small reservoir with a narrow outlet to delay gastric emptying. This limits caloric intake by causing early satiety. Purely restrictive surgery does not alter the absorption of nutrients. Vertical Banded Gastroplasty During a vertical banded gastroplasty (VBG), originally described by Mason in 1982, the fundus of the stomach is stapled parallel to the lesser curve and the outlet of the created pouch is narrowed with a 5 cm band. A gastric reservoir of about 50 mL remains and the banding provides an exit diameter of 10–12 mm [30]. Levels of excess weight loss (EWL) documented in one trial exceeded 50 in 74% of patients [31], although in most studies with 3 to 5-year follow-up, EWL of at least 50% has been achieved in only 40% of patients [32]. Mortality is 50% weight loss [74, 75]. Longterm results are good, with an average EWL of 60% at 5 years, decreasing to around 50% at 8–10 years [76]. If RYGB fails, then revisional surgery is complex and carries significant risk. A summary of the malabsorptive obesity surgery procedures is shown in Fig. 63.4.
63.3.2 Debates 63.3.2.1 Extent of Lymphadenectomy Needed for Cancer Clearance In cancer of the esophagus and GEJ, controversy exists over which type of operation to perform. An overview of the clinical approach to gastric cancer is presented
in Fig. 63.5. Some authors argue that the presence of lymph node involvement equals systemic disease; that in the face of such a poor prognostic factor survival will remain unchanged whatever the extent of resection and that the systematic removal of lymph nodes is of no benefit. Others believe that the natural course of the disease can be influenced positively by more aggressive surgery even in patients with positive lymph nodes. Mainly, under the Japanese influence, much attention has been paid to a meticulous extensive lymphadenectomy in the management of esophageal cancer [77]. Unfortunately, in many studies, the extent of esophagectomy and type of lymphadenectomy are poorly defined. An agreement on the anatomical classification of the different extents of lymphatic dissection for cancer of the esophagus has been reached only recently. More extensive surgery certainly improves the quality of the TNM staging system as prognostic index as it provides a better assessment of true survival in lymph node negative patients as compared with the more standard resection “contaminated” by false node negative patients. The lymph node groups draining the esophagus and stomach are displayed in Fig. 63.6. Published data indicate that R0 resection combined with extensive lymphadenectomy results in improved (loco-regional) disease free survival as well as
802
improved 5-year survival, although most data are coming from nonrandomized studies. Multimodality treatment regimens have to be adapted according to the highest quality standards and compared with the results of the best primary surgical therapy, including assessment of quality of life. Recent data analysis indicates that the number of involved lymph nodes removed at surgical resection is an independent predictor of 5-year survival after esophagectomy for cancer [78].
63.3.2.2 Minimal Access Esophagectomy In the late 1990s, several centers began exploring the potential for a minimally invasive esophagectomy. Techniques have now been developed for both a laparoscopic and a combined thoracoscopic/laparoscopic esophagectomy. Disadvantages of a completely laparoscopic approach include the inherent dangers of dissection near the pulmonary vessels high in the mediastinum and the inability to accomplish a systematic thoracic lymphadenectomy with this approach. However, the vagal-sparing procedure is ideally suited to a laparoscopic approach because the esophagus is stripped out of the mediastinum without any dissection, and no lymphadenectomy is necessary in these patients with only high-grade dysplasia (HGD) or intramucosal cancer. For patients with more advanced cancer, the combined thoracoscopic/laparoscopic approach offers the advantage of a thoracic lymphadenectomy and has been proven safe and effective in a large series of patients [79, 80]. Whether minimally invasive esophagectomy offers clear advantages in hospital stay and recovery, with equivalent oncological outcomes to open procedures is yet to be determined.
63.4 Recent Advances in Choice of Management Within Specialty 63.4.1 Gastro-Esophageal Reflux Disease (GERD) and Achalasia The immense success of laparoscopic surgery as an effective treatment of GERD and achalasia has established minimal invasive surgery as the gold standard in the surgical treatment of these two conditions [81, 82]. Compared to open surgery, laparoscopic procedures result in lower
D. Yakoub et al.
morbidity and mortality, shorter hospital stay, faster convalescence and less postoperative pain.
63.4.1.1 GERD Medication vs. Operation Antireflux surgery has been shown to be very effective at alleviating symptoms in 88–95% of patients, with excellent patient satisfaction, in both short- and longterm studies. In addition to symptomatic improvement, the long-term effectiveness of laparoscopic antireflux surgery (LARS) has been objectively confirmed with 24-h pH monitoring. Overall, LARS is safe and has a similar efficacy to open antireflux surgery and best medical therapy with PPIs. The failure rate in some series is greater than 50% at 5 years. Due to the cost of a proportion of patients still taking antireflux medications, it cannot be recommended on cost-effectiveness grounds over best medical therapy. There is a tendency for antireflux surgery to be superior to medical therapy for cancer prevention in Barrett’s esophagus, but this has not reached statistical significance [83–85]. Surgical therapy should be considered for patients with Barrett’s esophagus, especially those that are young or symptomatic.
Total vs. Partial Fundoplication Further insight into esophageal motility disorders and concerns of postoperative dysphagia has prompted the discussion whether the total or partial fundoplication is more appropriate treatment for GERD. Total wrap supporters acknowledge the fact that the wrap needs to be “floppy” to minimize postoperative dysphagia. Two partial fundoplications are in common practice; they are the Dor (anterior) and Toupet (posterior) fundoplications. Of these two, the Toupet is the most commonly performed. Fibbe et al. compared laparoscopic partial and total fundoplications in 200 patients with esophageal motility disorders and found no difference in postoperative recurrence of reflux [86]. Similarly, Laws et al. found no clear advantage of one wrap over the other in their prospective, randomized study comparing these two groups [87]. Moreover, in a meta-analysis of 11 prospective randomized trials including open and laparoscopic total vs. partial fundoplications in 991 patients, no statistically significant differences
63
Upper Gastrointestinal Surgery: Current Trends and Recent Innovations
were found in recurrence of reflux or long-term outcomes, although a higher incidence of postoperative dysphagia occurred after total fundoplication [88].
63.4.1.2 Achalasia Current controversies in the treatment of achalasia include the allocation of medical treatment, pneumatic dilation or surgical treatment to patients with achalasia. All of these therapies are designed to alleviate the symptoms of achalasia by addressing the failure of the LES to relax upon swallowing. However, none of these modalities address the underlying aperistalsis of the esophagus.
803
Pneumatic Dilation Under direct endoscopic visualization, a balloon is placed across the LES and is inflated up to 300 mmHg (10–12 psi) for 1–3 min. The goal is to produce a controlled tear of the LES muscle to render it incompetent, thereby relieving the obstruction. After pneumatic dilation, good to excellent response rates can be achieved in 60% of patients at 1 year. Long-term results are less satisfactory: at 5 years only 40% note symptom relief and at 10 years 36% of patients have relief of dysphagia. Repeat dilatations are technically possible but show a decreased success rate. The rate of esophageal perforations during pneumatic dilation is approximately 2%. The current recommendations regarding pneumatic dilation prior to (or instead of) surgical therapy depends upon the referring physician’s opinions, the surgeon’s expertise and the patient’s preference.
Pharmacotherapy Calcium channel blockers and nitrates have been widely used, with the sublingual application being preferred. Both medications effectively reduce the baseline LES tone, but they neither improve LES relaxation nor augment esophageal peristalsis. The current recommendations for pharmacotherapy are limited to patients who are early in their disease process without significant esophageal dilatation, patients who are high risk for more invasive modalities and patients who decline other treatment options.
Botulinum Toxin Direct injection into the LES with Botulinum toxin A (Botox, Allergan, Irvine, California) irreversibly inhibits acetylcholine release from cholinergic nerves. This results in decreased sphincter activity and a temporary amelioration of achalasia-related symptoms. Following Botox injection, about 60–85% patients suffering from achalasia will note an improvement in their symptoms. While repeated injections are technically possible, reduced efficacy is seen with each subsequent injection. Repeated esophageal instrumentation with Botox treatment inevitably causes an inflammatory reaction at the GEJ. The resulting fibrosis may make subsequent surgical treatment more difficult and increase the complication rate. There has been recent evidence that despite Botox therapy, laparoscopic Heller myotomy is as successful as myotomy in Botox-naïve patients.
Surgical Therapy Heller first described the surgical destruction of the gastroesophageal sphincter as therapy for achalasia in 1913. His original technique used two parallel myotomies that extend for at least 8 cm on the distal esophagus and proximal stomach. Since the inception of minimally invasive techniques, both the open transabdominal repair and the thoracic approach have fallen out of favor. At this time, the laparoscopic approach is considered superior to the above-mentioned surgical therapies. Laparoscopic Heller myotomy is noted to have a very low morbidity and mortality rate, less postoperative pain, improved cosmesis and a faster recovery compared to more invasive surgical therapies. Most importantly, excellent symptom relief is noted in 90% of patients. The current indications for laparoscopic surgical treatment of achalasia include patients less than 40 years of age, patients with persistent or recurrent dysphagia after unsuccessful Botox injection or pneumatic dilation, and patients who are high risk for perforation from dilation including those with esophageal diverticula, or distorted lower esophageal anatomy. Currently, there are two controversies concerning achalasia discussed within the surgical community: the extent to which the myotomy is extended onto the stomach and whether to add an antireflux procedure. The proximal extent of the myotomy is typically carried 5–6 cm proximal to the LES. The distal extent of the b
804
D. Yakoub et al.
Table 63.2 Selected trials comparing neoadjuvant chemotherapy with or without radiation and surgery vs. surgery alone in patients with localized esophageal cancer Author, (Year) Study Number Chemotherapy Total Median Survival (%) of radiotherapy survival patients dose (cGy) (months) Kelsen (1998)
MRC Trial (2002)
Cunningham (2006)
Surgery alone vs. pre-op chemotherapy with surgery
Surgery alone vs. pre-op chemotherapy with surgery
Surgery alone vs. pre-op chemotherapy, surgery, and postoperative chemotherapy
440
802
503
Cisplatin, fluorouracil
Cisplatin, fluorouracil
Epirubicin, Cisplatin, fluorouracil
–
–
–
16.1
37
14.9
35 (2-year survival)
13.3
34
16.8*
43*(2-year survival)
NA
23
36* (5-year survival) Urba (2001)
Walsh (1996)
Burmeister (2005)
Surgery alone vs. pre-op chemotherapy and radiotherapy with surgery
Surgery alone vs. pre-op chemotherapy and radiotherapy with surgery
Surgery alone vs. pre-op chemotherapy and radiotherapy with surgery
100
113
256
Cisplatin, vinblastine, fluorouracil
Cisplatin, fluorouracil
Cisplatin, fluorouracil
4,500
4,000
3,500
17.6
16
16.9
30 (3-year survival)
11
6*
16
32 (3-year survival)
22.2
NA
19.3
NA
MRC Medical Research Council; NA not available * P < 0.05
myotomy to either 1–2 cm or to at least 3 cm beyond the LES is debated. Oelschlager et al. found less dysphagia and subsequently fewer interventions to treat the recurrent dysphagia in favor of the distal myotomy group (3 vs. 17%) [89]. The consensus opinion is that a surgical myotomy predisposes patients to postoperative GERD, but while many surgeons add an antireflux procedure to address this potential problem, others report an adequate cardiomyotomy without postoperative reflux. Performing a fundoplication prolongs the operative time, and may cause problems with dysphagia in the setting of an aperistaltic esophagus. Proponents supporting the addition of an antireflux procedure mostly favor a partial (anterior, Dor vs. Posterior, Toupet) over a total (Nissen) fundoplication to avoid a functional obstruction from the
fundoplication and the persistence of dysphagia. There is a lack of prospective randomized studies comparing anterior vs. posterior procedures to demonstrate any clear advantages of one partial fundoplication over the other.
63.4.2 Barrett’s Esophagus and Esophageal Cancer 63.4.2.1 Combined Modality Adjuvant Treatment Randomized studies have demonstrated that preoperative chemotherapy improves survival in esophagogastric
63
Upper Gastrointestinal Surgery: Current Trends and Recent Innovations
adenocarcinoma [90]. Various comparative studies have attempted to seek evidence-based decisions on the most effective combination of different therapeutic modalities in treatment of esophageal cancer. Selected trials comparing those modalities are displayed in Table 63.2.
63.4.2.2 Palliative Management of Esophageal Cancer Esophageal stents have been successfully used to relieve dysphagia. A recent meta-analysis showed selfexpanding metal stents to be superior to plastic stents in terms of related mortality, morbidity and quality of palliation [91]. Uncovered stents are disadvantaged by a high rate of tumor in-growth; further adequately designed randomized controlled trials are required to examine outcomes and cost effectiveness of covered vs. uncovered metal stents.
63.4.2.3 Targeted Therapy Novel molecularly targeted anticancer agents include antiproliferative agents, apoptosis-inducing agents, antiangiogenic and anti-invasive agents. Although relatively few targets are being investigated or pursued clinically in esophageal cancer, there is a good rationale to do so based on the current knowledge of molecular abnormalities in this disease. Among the candidates for targeted therapy in esophageal cancer are oncogenes, such as EGFR, c-erbB2 and cyclin D1, as well as tumor suppressor genes, such as Rb, p53 and p16. An ongoing phase II study is investigating the EGFR tyrosine kinase inhibitor gefitinib in second line therapy of esophageal cancer, with a disease control rate of 37% in 27 patients recruited, and mild adverse effects reported [92].
63.4.2.4 Gene Therapy A gene-based approach investigated in esophageal cancer involves the agent TNFerade, a second-generation replication-defective adenoviral vector carrying the transgene encoding human tissue necrosis factor (TNF) and a radiation-inducible promoter. Preclinical experiments have demonstrated that TNFerade gene therapy plus radiation can induce tumor regression in models of esophageal cancer and other solid tumors.
805
A completed phase I study has shown the absence of dose-limiting toxicities of TNFerade plus radiation, no drug-related serious adverse events and a remarkably high tumor response rate in patients with advanced solid tumors. The agent is currently being investigated in a phase II study in patients with advanced esophageal cancer in combination with conventional chemotherapy and radiation [93].
63.5 Molecular and Biological Developments Within Specialty 63.5.1 Progression of Barrett’s Disease and Adenocarcinoma of the Esophagus The majority of Barrett’s patients will not develop cancer, so specific methods to identify high-risk groups are required. The risk of cancer development in Barrett esophagus has been confirmed to be related to risk factors for GERD. These risk factors, however, are often asymptomatic and therefore not helpful in the individualization of surveillance intervals. Recent molecular studies have identified a selection of candidate biomarkers that need validation in prospective studies. They reflect various changes in cell behavior during neoplastic progression. Barrett’s epithelium is characterized by the presence of goblet cells, and expression of intestinal markers such as MUC2, alkaline phosphatase, villin and sucrase isomaltase. Transcription factors such as CDX1 and CDX2 that play an important role in the development of intestinal epithelium in utero may also be important in the development of metaplastic epithelium in the esophagus. The presence of CDX2 protein and mRNA has been shown in esophageal adenocarcinoma, in Barrett’s metaplasia cells of the intestinal type with goblet cell specific Muc2 mRNA and also in the squamous epithelium of a proportion of patients with Barrett’s metaplasia. Molecular changes in the metaplasia–dysplasia–adenocarcinoma sequence are driven by genomic instability and evolution of clones of cells with accumulated genetic errors that carry selection advantage and allow successive clonal expansion. Some chromosomal aberrations, including aneuploidy and loss of heterozygosity (LOH), genetic
806
mutations and epigenetic abnormalities of tumor suppressor genes have been identified [94–98]. Cell cycle regulatory genes known to be implicated in esophageal adenocarcinoma development include p16, p53 and cyclin D1. p16 lesions are very common in Barrett’s metaplasia and include LOH, mutation and methylation of the promoter. Histochemically assessed cyclin D1 overexpression has been documented in Barrett’s esophagus and adenocarcinoma. One prospective study has shown that Barrett’s metaplasia patients with cyclin D1 overexpression were at increased risk of cancer development compared to patients with normal expression. Hyperproliferation has been consistently observed in Barrett’s metaplasia by many assays including immunohistochemistry staining for division markers such as proliferating cell nuclear antigen (PCNA) and Ki67, and flow cytometry for DNA content, but there are no advanced phase studies showing that proliferation indices have any predictive value for cancer progression [99–102]. p53 lesions occur frequently in esophageal adenocarcinomas (85–95%) and almost never in normal tissue from the same patients; their prevalence increases with advancing histologic grade of dysplasia, which makes them appropriate candidates for further studies [103, 104]. Reid et al. evaluated 17p (p53) LOH in a large phase 4 study with prospective observation of 256 patients and esophageal adenocarcinoma as the primary end point. In this study, 17p (p53) LOH was a strong and significant predictor of progression to esophageal adenocarcinoma [105].
63.5.1.1 Invasion E-cadherin Cadherins are a family of calcium dependent cell–cell adhesion molecules essential to the maintenance of intercellular connections, cell polarity and cellular differentiation. Germline mutation of the E-cadherin gene (CDH1) causes familial gastric cancer. Loss of E-cadherin expression is associated with many nonfamilial human cancers, including esophageal adenocarcinoma. The expression of E-cadherin is significantly lower in patients with Barrett’s esophagus compared with normal esophageal epithelium and the reduction of its expression is observed when the metaplasia–dysplasia–adenocarcinoma sequence progresses. These findings suggest that
D. Yakoub et al.
E-cadherin may serve as a tumor suppressor early in the process of carcinogenesis [106–108].
COX-2 Cyclooxygenase-2 (COX-2) is normally found in the kidney and brain but in other tissues, its expression is inducible and rises during inflammation, wound healing and neoplastic growth in response to interleukins, cytokines, hormones, growth factors and tumor promoters. COX-2 and derived prostaglandin E2 (PGE2) appear to be implicated in carcinogenesis because they prolong the survival of abnormal cells. They reduce apoptosis and cell adhesion, increase cell proliferation, promote angiogenesis and invasion and make cancer cells resistant to the host immune response. COX-2 is expressed in the normal esophagus but its expression was found to be significantly increased in Barrett’s esophagus and even more in HGD and esophageal adenocarcinoma. Some authors suggest that COX-2 expression might be of prognostic value in esophageal adenocarcinoma. The COX-2 immunoreactivity study in cancer tissues showed that patients with high COX-2 expression were more likely to develop distant metastases and local recurrence and had significantly reduced survival rates when compared to those with low expression.
63.5.1.2 Dysplasia The diagnosis of Barrett’s metaplasia warrants regular endoscopic surveillance for dysplasia with four-quadrant biopsies at 2 cm intervals in the affected esophageal portion. HGD in Barrett’s esophagus is an indication for esophagectomy or endoscopic therapies. The prognostic significance of dysplasia in Barrett’s esophagus has been extensively studied for several decades and results are inconsistent. Dysplasia alone is not an entirely reliable biomarker and more specific progression predictors need to be included in surveillance recommendations to increase their efficacy and cost effectiveness [109, 110].
63.5.2 ASPECT Study The largest prospective intervention study in patients with Barrett’s metaplasia was started in April 2004 in the
63
Upper Gastrointestinal Surgery: Current Trends and Recent Innovations
UK. It is organized by the National Institute for Cancer Research and sponsored by Cancer Research UK. ASPECT, a phase IIIb randomized study of aspirin and esomeprazole chemoprevention in Barrett’s metaplasia, is a national, multicentre, randomized controlled trial of low or high dose esomeprazole with or without low dose aspirin for 8 years [111]. Its primary aim is to study if intervention with aspirin can result in a decreased mortality or conversion rate from Barrett’s metaplasia to adenocarcinoma or HGD and if high dose PPI treatment can decrease the cancer risk further. Biomarker studies in this project include characterization and validation of p16, TP53 (gene expression and LOH and mutation analysis) and also aneuploidy/ploidy changes (by flow cytometry), currently the best characterized biomarkers for prediction of progression. Other changes in protein expression will also be studied including CDX2, COX-2, protein kinase C-e (PKCe), and minichromosome maintenance protein 2 (Mcm2) expression. The aim is to address whether these markers are reproducible and if they are clinically informative (sensitivity and specificity) at 2 and 4 years. In addition, microarray analysis is underway as well as identification of novel DNA SNP (single nucleotide polymorphism) signatures and target regions for expression analysis and investigation of clonality.
63.5.3 Tumor Markers in Esophageal Cancer Staging Studies investigating tumor suppressor genes, cell adhesion molecules and apoptosis-related genes have yielded relatively disappointing results in terms of identifying useful markers for daily practice. Available data indicate that progression from BE to HGD and adenocarcinoma is a continuum rather than a distinct stepwise process and that no single marker reliably discriminates lesions that will progress.
63.5.4 Diagnostic Laparoscopy/ Thoracoscopy The increasing number of treatment options with a curative or a palliative intent has increased the importance of proper patient selection for a specific treatment.
807
Diagnostic laparoscopy has been claimed to be superior to all noninvasive imaging modalities in the detection of liver metastases, intra-abdominal lymph node metastases and peritoneal tumor spread. Diagnostic gains in the range of 10–50% have been claimed by adding diagnostic laparoscopy to the staging of cancer of the GEJ. The potential diagnostic gain needs to be balanced against the fact that diagnostic laparoscopy is costly, time-consuming and associated with a risk of serious complications.
63.5.5 Detection of Micrometastases in Bone Marrow Cytokeratin 18-positive micrometastases have been found in the bone marrow in 80–90% of patients undergoing curative resection of node-positive or node-negative esophageal adenocarcinoma or SCC [112, 113]. These results suggest that hematogenous dissemination occurs independently of lymphatic spread and that a node-negative status does not preclude metastatic spread. In patients receiving neoadjuvant chemoradiotherapy, the detection rate of bone marrow metastatic cells was less than 40%, but after marrow culture, viable cytokeratin-positive cells were detectable in a further 30% of patients [114]. Despite pathological complete responses of the primary tumor occurring in some patients after neoadjuvant therapy, micrometastases were found in the bone marrow of the same patients, implying resistance of metastatic cells to the chemotherapeutic agents used.
63.6 Imaging and Diagnostics 63.6.1 Upper Gastrointestinal Barium Examination (UGI) Endoscopy and the UGI are the primary modalities for the detection of gastric cancer. The double-contrast examination is the single best radiological technique for the detection of early gastric cancer. A single-contrast examination alone has an overall sensitivity of only 75% in diagnosing gastric cancer. Any lesion that has a mixed pattern is not unequivocally benign and warrants biopsy.
808
63.6.2 Staging Modalities and Prognostic Indicators Once the diagnosis of gastric cancer is established, further studies are directed at staging to assist with therapeutic decisions. Endoscopic ultrasound (EUS) and computed tomography (CT) are the current primary staging modalities for esophageal and gastric cancer.
63.6.2.1 Endoscopic Ultrasound Endosonographic T staging is based on the number of visceral wall layers that are disrupted as well as the preservation or destruction of sonographic interfaces between adjacent organs and vessels. N staging is based on the presence and location of perivisceral lymph nodes that fit certain criteria (diameter 10 mm, round shape, uniform hypoechoic structure, wellcircumscribed margins) or that are found to harbor malignant cells by EUS-guided transvisceral fine needle aspiration (FNA). Due to its limited depth of penetration, endosonography is less useful for M staging. However, with low frequency options on newer echoendoscopes, much of the liver can often be surveyed and even sampled from the stomach and duodenum. Accuracy of EUS for T staging in esophageal and gastric cancer is approximately 82%, with a sensitivity and specificity of 70–100 and 87–100%, respectively. N-stage accuracy has been shown to be approximately 70%, with sensitivity and specificity that ranges from 69.9 to 100% and 87.5 to 100%, respectively. Addition of FNA of suspicious nodes increases the accuracy even further, bringing specificity to 100%. In addition, EUS-guided FNA or Tru-Cut® (Baxter, Deerfield, IL) biopsy of submucosa can provide a tissue diagnosis in the setting of linitis plastica. EUS can be particularly useful in early stage esophageal and gastric cancer where it can sometimes allow for EMR [115].
63.6.2.2 Computed Tomography Computed tomography scanning provides critical information in treatment planning for esophageal and gastric cancer patients. Helical computed tomography is the single best noninvasive means of detecting metastatic disease. Multidetector CT (MDCT) has enabled faster
D. Yakoub et al.
scanning with simultaneous acquisition of multiple thin slices. Use of thin slice collimation decreases partial volume artifact that enables more accurate assessment of the wall in a curved organ. The acquisition of volumetric high-resolution data made possible by MDCT also allows multiplanar reconstructions (MPR) to be performed on a workstation, depicting most segments of the gastric wall in the optimal orthogonal plane with enhanced anatomical detail. In 2005, Shimizu and colleagues published data demonstrating an 85% accuracy of CT for T staging, using MDCT with 1.25-mm multiplanar reconstructed images [116]. CT scanning is not helpful for N staging; the cutoff for normal nodal size is 8–10 mm, yet metastases are most commonly found in lymph nodes smaller than 8 mm. 63.6.2.3 Magnetic Resonance (MR) Currently, MR is used primarily when CT iodinated contrast is contraindicated due to a significant contrast allergy or renal failure. MR may also be used to confirm the presence of equivocal liver masses seen on CT. Further developments in MR may include the use of endoluminal coils for better definition of the gastric wall layers and improved T staging and regional nodal staging accuracy. Dynamic contrast-enhanced (DCE)MRI with low-mW Gadolinium-DTPA is currently the most widespread and accurate means of imaging angiogenesis, and new methods will continue to be compared to this standard.
63.6.2.4 Positron Emission Tomography (PET) Potential uses of PET in esophageal and gastric cancer patients are in staging, detecting recurrence, determining prognosis and measuring therapy response. The major advantage of PET is substantially greater contrast resolution and the acquisition of functional data. PET can detect lymph node metastases before lymph nodes are enlarged on CT. The limitations of PET are lower sensitivity for small lesions and false-positive results from infectious or inflammatory processes. In addition, PET studies are relatively expensive compared with other imaging modalities. Combined PET and CT (PET/ CT) scanners have been introduced recently. Although PET does not have a role in the primary detection of gastric carcinoma, the degree of uptake in a
63
Upper Gastrointestinal Surgery: Current Trends and Recent Innovations
known gastric carcinoma has prognostic value. Moderately intense fluorodeoxyglucose (FDG) uptake in the gastric wall is a normal variant. Despite this, the majority (60– 96%) of primary gastric neoplasms are detected by PET. A greater degree of FDG uptake is associated with greater depth of invasion, size of tumor, and lymph node metastases. The survival rate in patients with high FDG uptake is significantly lower than in patients with low FDG uptake. Although CT is more sensitive than PET for detection of lymph node metastases in N1 and N2 disease, PET is more specific. PET may be more sensitive for the detection of non-nodal sites such as liver and lung metastases but not for bone, peritoneal, and pleural metastases. PET may also have value in the prediction of response to preoperative chemotherapy in esophago-gastric cancer.
63.6.2.5 Prognostic Factors in Gastroesophageal Cancer While many factors affect prognosis in gastric and esophageal cancer patients, the number of positive lymph nodes is the most consistent prognostic indicator. Fiveyear survival rates in patients with 1–6, 7–15 and greater than 15 positive nodes are 43, 21 and 13%, respectively. A multicenter observational study of 477 gastric cancer patients demonstrated that the number of positive lymph nodes was of better prognostic value than the location of the involved nodal basin. Positive peritoneal cytology is also associated with decreased survival. Recent studies suggest that the sensitivity of peritoneal cytologic evaluation is enhanced with real-time reverse transcriptasepolymerase chain reaction (RT-PCR) amplifying CEA [117]. Cytologic evaluation for malignant cells using this technique was positive in 10, 29, 66 and 81% of patients with T1, T2, T3 and T4 tumors, respectively. Positive tests were strongly correlated with eventual peritoneal carcinomatosis and survival. Detection of micrometastases of esophageal and gastric cancers may have a role in staging before neoadjuvant therapy [118].
809
such as minimal access surgery, robot-assisted surgery and various extents of lymphadenectomy. The need for specialized training has largely been answered by the introduction of simulated operative environments. Various sophisticated bench models have been designed to be positioned inside box trainers as well as simulated patient models that can even be situated in a full “virtual theatre” environment. Validation of these methods has been achieved; recent studies have shown that simulation-based training translates to improved performance on real cases in terms of reduced operating time, fewer errors and decreased patient discomfort. The next generation of simulated environment is virtual reality simulation (VR) that has the clear advantage of high fidelity of the visual elements involved and the possibility of simulating almost any operative environment. Progress has been made to improve the tactile feedback of these simulators to mimic actual tissue resistance during operating on live patients.
63.7.2 Hospital Volume/Outcome Relationship The role of experience in improving surgical outcomes is well established [119, 120]. The impact of hospital volume on clinical and economic outcomes for esophagectomy has been reported [121, 122]. Hospitals were stratified to low-volume hospitals doing less than six esophagectomies a year and high-volume hospitals doing more than six. The study consisted of 1,193 patients who underwent esophagectomy during an 8-year study period. High-volume hospitals were associated with a 2-day decrease in median length of stay, a 3-day reduction in median intensive care unit stay, an increased rate of home discharges and a 3.7% decrease in hospital mortality (9.2 vs. 2.5%) [121].
63.8 Future Directions in Research and Management Within Specialty 63.7 Training Within Specialty 63.8.1 Future Directions in Drug Therapy 63.7.1 Surgical Skills Training Upper gastrointestinal surgery has evolved enormously over the last 50 years with the introduction of techniques
Future research in this area is evolving along three separate lines. The first examines the role of new chemotherapeutics, particularly oxaliplatin, irinotecan, and
810
oral 5-FU “prodrugs” such as capecitabine and S-1 that have proven valuable in other gastrointestinal malignancies. The second examines drugs designed to inhibit the function of a particular molecular target critical to cancer cell growth such as cetuximab, an epidermal growth factor inhibitor, and bevacizumab, a vascular epidermal growth factor inhibitor, both given in conjunction with chemotherapy. Finally, there is a growing emphasis on clinical and molecular predictors of chemotherapeutic responsive in advanced gastric cancer. Such considerations are particularly important given the relatively modest benefit/toxicity ratios of present chemotherapy regimens for gastric cancer. Low tissue levels of expression of thymidine synthetase (critical to intracellular 5-FU metabolism), p53 (related to chemotherapeutic resistance), and bcl-2 (important to cellular apoptosis) have been found to correlate with superior survival in gastric cancer [123, 124].
63.8.2 Staging of Gastric and Esophageal Cancer With the emergence of laparoscopic ultrasound, nodal staging is now possible with laparoscopy. Finch and colleagues indicate that laparoscopic ultrasound is 84% accurate in TNM staging of esophageal cancers. This study compared laparoscopic ultrasound to CT and laparoscopy and showed a clear benefit for ultrasound in assessing GI cancers [125]. No single laboratory test yet exists to facilitate diagnosis and detection of recurrent gastric cancer. New techniques are emerging for the detection of individuals at increased risk for gastric cancer based on their genetic composition. These technologies include cDNA microarray, serial analysis of gene expression (SAGE), differential display and subtractive hydridization.
D. Yakoub et al.
Tokyo. The software can be used to guide surgeons in the extent of lymphadenectomy and predict survival for a specific patient [126]. Similarly, nomograms or scoring systems have been developed that predict survival based on weighted importance of sex, age, tumor location, Lauren histologic type, T stage, N stage and extent of lymphadenectomy.
63.8.4 Global molecular profiling Emerging methods of global gene and protein profiling using bioinformatics are increasingly employed for molecular evaluation of Barrett’s metaplasia. Results are preliminary and do not yet have clinical relevance. Mitas et al. developed a quantitative mathematical algorithm based on the expression levels of a panel of three genes including TSPAN (tetraspan1), ECGF1 (endothelial growth factor 1) and SPARC (secreted protein, acidic, cysteine-rich) to discriminate between Barrett’s esophagus and esophageal adenocarcinoma [127]. Proteomic studies using mass spectrometry enable direct analysis of epithelial protein expression patterns, used in combination with microdissection by Streitz and colleagues [128]. Identification of specific biomarkers is necessary to select patients who need intensified surveillance, to better characterize populations for intervention studies including chemoprevention and to improve outcomes and reduce care costs. In order to facilitate evaluation of the appropriateness and quality of further studies and to improve the ability to compare results, NCI-EORTC developed and published Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK, reporting of tumor marker studies), available at http://www.cancerdiagnosis.nci.nih.gov/assessment/progress/clinical.html.
References 63.8.3 Prediction of Survival and Guidance of Treatment Other systems have been proposed to guide treatment and predict survival of patients with gastric cancer. Maruyama has developed computer software based on demographic and clinical features of 4,302 gastric cancer patients at the National Cancer Center Hospital in
1. Ballantyne GH (2007) Telerobotic gastrointestinal surgery: phase 2 – safety and efficacy. Surg Endosc 21:1054–1062 2. Ruurda JP, Broeders IA, Simmermacher RP et al (2002) Feasibility of robot-assisted laparoscopic surgery: an evaluation of 35 robot-assisted laparoscopic cholecystectomies. Surg Laparosc Endosc Percutan Tech 12:41–45 3. Berguer R, Smith W (2006) An ergonomic comparison of robotic and laparoscopic technique: the influence of surgeon experience and task complexity. J Surg Res 134:87–92
63
Upper Gastrointestinal Surgery: Current Trends and Recent Innovations
4. Bodner JC, Zitt M, Ott H et al (2005) Robotic-assisted thoracoscopic surgery (RATS) for benign and malignant esophageal tumors. Ann Thorac Surg 80:1202–1206 5. Kalloo AN, Singh VK, Jagannath SB et al (2004) Flexible transgastric peritoneoscopy: a novel approach to diagnostic and therapeutic interventions in the peritoneal cavity. Gastrointest Endosc 60:114–117 6. Pai RD, Fong DG, Bundga ME et al (2006) Transcolonic endoscopic cholecystectomy: a NOTES survival study in a porcine model (with video). Gastrointest Endosc 64:428–434 7. Rattner D, Kalloo A (2006) ASGE/SAGES Working Group on Natural Orifice Translumenal Endoscopic Surgery. Surg Endosc 20:329–333 8. Palanivelu C, Rajan PS, Rangarajan M et al (2008) Transvaginal endoscopic appendectomy in humans: a unique approach to NOTES-world’s first report. Surg Endosc 22: 1343–1347 9. Bernhardt J, Gerber B, Schober HC et al (2008) NOTES – case report of a unidirectional flexible appendectomy. Int J Colorectal Dis 23:547–550 10. Rattner DW (2008) NOTES: Where have we been and where are we going? Surg Endosc 22:1143–1145 11. Triadafilopoulos G (2004) Changes in GERD symptom scores correlate with improvement in esophageal acid exposure after the Stretta procedure. Surg Endosc 18:1038–1044 12. Triadafilopoulos G (2007) Endotherapy and surgery for GERD. J Clin Gastroenterol 41(Suppl 2):S87–S96 13. Cicala M, Gabbrielli A, Emerenziani S et al (2005) Effect of endoscopic augmentation of the lower oesophageal sphincter (Gatekeeper reflux repair system) on intraoesophageal dynamic characteristics of acid reflux. Gut 54:183–186 14. Fockens P, Bruno MJ, Gabbrielli A et al (2004) Endoscopic augmentation of the lower esophageal sphincter for the treatment of gastroesophageal reflux disease: multicenter study of the Gatekeeper Reflux Repair System. Endoscopy 36:682–689 15. Pleskow D, Rothstein R, Lo S et al (2005) Endoscopic fullthickness plication for the treatment of GERD: 12-month follow-up for the North American open-label trial. Gastrointest Endosc 61:643–649 16. Pleskow D, Rothstein R, Kozarek R et al (2007) Endoscopic full-thickness plication for the treatment of GERD: longterm multicenter results. Surg Endosc 21:439–444 17. Pleskow D, Rothstein R, Kozarek R et al (2008) Endoscopic full-thickness plication for the treatment of GERD: five-year long-term multicenter results. Surg Endosc 22:326–332 18. Rothstein RI (2008) Endoscopic therapy of gastroesophageal reflux disease: outcomes of the randomized-controlled trials done to date. J Clin Gastroenterol 42:594–602 19. Conio M, Cameron AJ, Chak A et al (2005) Endoscopic treatment of high-grade dysplasia and early cancer in Barrett’s oesophagus. Lancet Oncol 6:311–321 20. Conio M, Repici A, Cestari R et al (2005) Endoscopic mucosal resection for high-grade dysplasia and intramucosal carcinoma in Barrett’s esophagus: an Italian experience. World J Gastroenterol 11:6650–6655 21. Gross SA, Wolfsen HC (2008) The use of photodynamic therapy for diseases of the esophagus. J Environ Pathol Toxicol Oncol 27:5–21 22. Tokar JL, Haluszka O, Weinberg DS (2007) Endoscopic therapy of dysplasia and early-stage cancers of the esophagus. Semin Radiat Oncol 17:10–21
811
23. Wolfsen HC (2005) Present status of photodynamic therapy for high-grade dysplasia in Barrett’s esophagus. J Clin Gastroenterol 39:189–202 24. Sproston P, Primatesta P (2003) Health survey for England 2002. The health of children and young people. The Stationery Office, London 25. Kopelman P, Jebb SA, Butland B. (2007) Executive Summary: Foresight 'Tackling Obesities: Future Choices' project. Obes Rev. Mar;8 Suppl 1:vi-ix 26. Field AE, Coakley EH, Must A et al (2001) Impact of overweight on the risk of developing common chronic diseases during a 10-year period. Arch Intern Med 161:1581–1586 27. Price GM, Uauy R, Breeze E et al (2006) Weight, shape, and mortality risk in older persons: elevated waist-hip ratio, not high body mass index, is associated with a greater risk of death. Am J Clin Nutr 84:449–460 28. Baumer JH. (2007) Obesity and overweight: its prevention, identification, assessment and management. Arch Dis Child Educ Pract Ed. Jun;92(3):ep92–6 29. Kohn GP, Galanko JA, Overby DW, Farrell TM. (2009) Recent Trends in bariatric surgery case volume in the United States. Surgery Aug:146(2):375–80 30. Schneider BE, Mun EC (2005) Surgical management of morbid obesity. Diabetes Care 28:475–480 31. Morino M, Toppino M, Bonnet G et al (2003) Laparoscopic adjustable silicone gastric banding versus vertical banded gastroplasty in morbidly obese patients: a prospective randomized controlled clinical trial. Ann Surg 238:835–841; discussion 841–842 32. Sugerman HJ, Kral JG (2005) Evidence-based medicine reports on obesity surgery: a critique. Int J Obes (Lond) 29: 735–745 33. Alper D, Ramadan E, Vishne T et al (2000) Silastic ring vertical gastroplasty- long-term results and complications. Obes Surg 10:250–254 34. Suter M, Jayet C, Jayet A (2000) Vertical banded gastroplasty: long-term results comparing three different techniques. Obes Surg 10:41–46; discussion 47 35. Belachew M, Legrand MJ, Defechereux TH et al (1994) Laparoscopic adjustable silicone gastric banding in the treatment of morbid obesity. A preliminary report. Surg Endosc 8:1354–1356 36. DeMaria EJ, Jamal MK (2005) Laparoscopic adjustable gastric banding: evolving clinical experience. Surg Clin North Am 85:773–787; vii 37. Angrisani L, Furbetta F, Doldi SB et al (2003) Lap band adjustable gastric banding system: the Italian experience with 1863 patients operated on 6 years. Surg Endosc 17: 409–412 38. DeMaria EJ, Schauer P, Patterson E et al (2005) The optimal surgical management of the super-obese patient: the debate. Presented at the annual meeting of the Society of American Gastrointestinal and Endoscopic Surgeons, Hollywood, Florida, USA, April 13–16, 2005. Surg Innov 12:107–121 39. Weiner R, Blanco-Engert R, Weiner S et al (2003) Outcome after laparoscopic adjustable gastric banding – 8 years experience. Obes Surg 13:427–434 40. Dolan K, Hatzifotis M, Newbury L et al (2004) A comparison of laparoscopic adjustable gastric banding and biliopancreatic diversion in superobesity. Obes Surg 14:165–169 41. Mognol P, Chosidow D, Marmuse JP (2005) Laparoscopic gastric bypass versus laparoscopic adjustable gastric
812 banding in the super-obese: a comparative study of 290 patients. Obes Surg 15:76–81 42. Chevallier JM, Zinzindohoue F, Douard R et al (2004) Complications after laparoscopic adjustable gastric banding for morbid obesity: experience with 1,000 patients over 7 years. Obes Surg 14:407–414 43. Fielding GA, Ren CJ (2005) Laparoscopic adjustable gastric band. Surg Clin North Am 85:129–140; x 44. Dargent J (2005) Esophageal dilatation after laparoscopic adjustable gastric banding: definition and strategy. Obes Surg 15:843–848 45. Shen R, Dugay G, Rajaram K et al (2004) Impact of patient follow-up on weight loss after bariatric surgery. Obes Surg 14:514–519 46. Moon Han S, Kim WW, Oh JH (2005) Results of laparoscopic sleeve gastrectomy (LSG) at 1 year in morbidly obese Korean patients. Obes Surg 15:1469–1475 47. Silecchia G, Boru C, Pecchia A et al (2006) Effectiveness of laparoscopic sleeve gastrectomy (first stage of biliopancreatic diversion with duodenal switch) on co-morbidities in super-obese high-risk patients. Obes Surg 16:1138–1144 48. Frezza EE (2007) Laparoscopic vertical sleeve gastrectomy for morbid obesity. The future procedure of choice? Surg Today 37:275–281 49. Deviere J, Ojeda Valdes G, Cuevas Herrera L et al (2008) Safety, feasibility and weight loss after transoral gastroplasty: first human multicenter study. Surg Endosc 22:589–598 50. Kantsevoy SV, Hu B, Jagannath SB et al (2007) Technical feasibility of endoscopic gastric reduction: a pilot study in a porcine model. Gastrointest Endosc 65:510–513 51. Genco A, Bruni T, Doldi SB et al (2005) BioEnterics intragastric balloon: the Italian experience with 2,515 patients. Obes Surg 15:1161–1164 52. Genco A, Cipriano M, Bacci V et al (2006) BioEnterics intragastric balloon (BIB): a short-term, double-blind, randomised, controlled, crossover study on weight reduction in morbidly obese patients. Int J Obes (Lond) 30:129–133 53. Melissas J, Mouzas J, Filis D et al (2006) The intragastric balloon – smoothing the path to bariatric surgery. Obes Surg 16:897–902 54. Spyropoulos C, Katsakoulis E, Mead N et al (2007) Intragastric balloon for high-risk super-obese patients: a prospective analysis of efficacy. Surg Obes Relat Dis 3:78–83 55. Scopinaro N, Marinari G, Camerini G et al (2005) Biliopancreatic diversion for obesity: state of the art. Surg Obes Relat Dis 1:317–328 56. Buchwald H, Avidor Y, Braunwald E et al (2004) Bariatric surgery: a systematic review and meta-analysis. JAMA 292:1724–1737 57. Scopinaro N (2006) Biliopancreatic diversion: mechanisms of action and long-term results. Obes Surg 16:683–689 58. Scopinaro N, Papadia F, Marinari G et al (2007) Long-term control of type 2 diabetes mellitus and the other major components of the metabolic syndrome after biliopancreatic diversion in patients with BMI