Smart Object, not Smart Environment: Cooperative Augmentation of Smart Objects Using Projector-Camera Systems David Molyneaux April 2008 Lancaster University Submitted in part fulfilment of the requirement for the degree of Doctor of Philosophy in Computer Science Abstract Smart objects research explores embedding sensing and computing into everyday objects - augmenting objects to be a source of information on their identity, state, and context in the physical world. A major challenge for the design of smart objects is to preserve their original appearance, purpose and function. Consequently, many research projects have focussed on adding input capabilities to objects, while neglecting the requirement for an output capability which would provide a balanced interface. This thesis presents a new approach to add output capability by smart objects cooperating with projector-camera systems. The concept of Cooperative Augmentation enables the knowledge required for visual detection, tracking and projection on smart objects to be embedded within the object itself. This allows projector-camera systems to provide generic display services, enabling spontaneous use by any smart object to achieve non-invasive and interactive projected displays on their surfaces. Smart objects cooperate to achieve this by describing their appearance directly to the projector-camera systems and use embedded sensing to constrain the visual detection process. We investigate natural appearance vision-based detection methods and perform an experimental study specifically analysing the increase in detection performance achieved with movement sensing in the target object. We find that detection performance significantly increases with sensing, indicating the combination of different sensing modalities is important, and that different objects require different appearance representations and detection methods. These studies inform the design and implementation of a system architecture which serves as the basis for three applications demonstrating the aspects of visual detection, integration of sensing, projection, interaction with displays and knowledge updating. The displays achieved with Cooperative Augmentation allow any smart object to deliver visual feedback to users from implicit and explicit interaction with information represented or sensed by the physical object, supporting objects as both input and output medium simultaneously. This contributes to the central vision of Ubiquitous Computing by enabling users to address tasks in physical space with direct manipulation and have feedback on the objects themselves, where it belongs in the real world. ii Preface This dissertation has not been submitted in support of an application for another degree at this or any other university. It is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated. Excerpts of this thesis have been published in conference and workshop manuscripts, most notably: [Molyneaux and Gellersen 2006] [Molyneaux, Gellersen et al. 2007] [Molyneaux, Gellersen et al. 2008] iii Acknowledgments This work would not have been possible without the support and facilities offered by my supervisors Prof. Hans Gellersen and Dr. Gerd Kortuem. I thank them both for their never-ending encouragement, inspiration and guidance throughout my time as a PhD student. I consider myself lucky to have been given the opportunity to work in such an interesting field and honoured to have been part of the exceptional research group they have created at Lancaster. The unique lab environment with the large international crowd of people makes it a special place, and one that I will cherish forever. I wish to thank all those other colleagues in the Ubicomp group at Lancaster who, either through direct assistance, or in discussions have helped this work in some way: Henoc Agbota, Mohammed Alloulah, Bashar Al Takrouri, Urs Bischoff, Florian Block, Carl Fischer, Roswitha Gostner, Andrew Greaves, Yukang Guo, Robert Hardy, Mike Hazas, Henrik Jernström, Serko Katsikian, Chris Kray, Kristof van Laerhoeven, Rene Mayrhofer, Matthew Oppenheim, Faizul Abdul Ridzab, Enrico Rukzio, Dominik Schmidt, Jennifer Sheridan, Martin Strohbach, Vasughi Sundramoorthy, Nic Villar, Jamie Ward and Martyn Welsh. Plus all the visitors and other alumni: Aras Bilgen, Martin Berchtold, Andreas Bulling, Clara Fernández de Castro, Manuel García-Herranz del Olmo (especially for all the boiled eggs), Alina Hang, Pablo Haya, Paul Holleis, Matthew Jervis, Russell Johnson, Yasue Kishino, Matthias Kranz, Masood Masoodian, Christoph März, Michael Müller, Kavitha Muthukrishnan, Albrecht Schmidt, Sara Streng, Tsutomu Terada and all those other people I have encountered at Lancaster during my time here. Special thanks go to some of the most important people at Lancaster - the secretaries and support staff who have helped me during my time here Aimee, Ben, Cath, Chris, Gillian, Helen, Ian, Jess, Liz, Sarah, Steve and Trish. Also thanks to Gordon for being a superb head of department and to CAKES for interesting presentations and away-days. Others outside of Lancaster have also helped greatly with this work, most notably Bernt Schiele in Darmstadt – without his support and guidance much of this work would not have been possible. I am greatly indebted to him. Thanks also to everyone from Multimodal Interactive Systems in Darmstadt for the great visits: Krystian Mikolajczyk, Bastion Leibe, Gyuri Dorko, Edgar Seemann, Mario Fritz, Andreas Zinnen, Nicky Kern, Tam Huynh, Ulrich Steinhoff, Niko Majer and Ursula Paeckel. The FLUIDUM project and specifically Andreas Butz and Mira Spassova deserve a big mention, as that is where it all started. Without your initial support, Java code and spark of enthusiasm for the steerable projector work I would never have travelled as far as I did during my PhD journey. Thanks also to those whose code was incorporated in some form in the demos, most notably Mark Pupilli from Bristol for the basic Particle Filter and Rob Hess for the SIFT implementation I modified for my work. Thanks to both my examiners – Mike Hazas and Andrew Calway from the University of Bristol for the interesting discussions and insights during the viva that have enhanced this thesis greatly beyond its initial submission. Most importantly, I thank my family – Mum, Dad, Jane and grandparents - for all the love and support (in every sense of the word) they have given me over the years. Rose also deserves a special mention. Without her I would still be stuck in my second year with no demo or paper. She has inspired me, kept me on track, put up with my grumpy mornings and read probably more about steerable projectors and Cooperative iv Augmentation in the last two years than she would otherwise want to... Thank you for being there for me Rose, I hope I can do the same for you. Finally, I would like to thank the projects and organisations that funded this work, namely the EPSRC, the Ministry of Economic Affairs of the Netherlands through the BSIK project Smart Surroundings under contract no. 03060 and Lancaster University through the e-Campus grant. David Molyneaux Lancaster, September 2008 v CONTENTS Contents CHAPTER 1 1.1 1.2 1.3 1.4 1.5 SMART OBJECT OUTPUT ....................................................................................................... 1 COOPERATIVE AUGMENTATION ............................................................................................ 2 CHALLENGES ........................................................................................................................ 3 CONTRIBUTIONS ................................................................................................................... 4 THESIS STRUCTURE............................................................................................................... 5 CHAPTER 2 2.1 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 2.3.8 2.3.9 2.4 2.5 2.6 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.6.6 2.7 2.7.1 2.7.2 2.8 RELATED WORK ................................................................................................... 6 INTRODUCTION ..................................................................................................................... 6 UBIQUITOUS COMPUTING ..................................................................................................... 8 Sensor Nodes ................................................................................................................... 9 Smart Objects ................................................................................................................ 10 Tangible User Interfaces ............................................................................................... 12 Input-Output Imbalance ................................................................................................ 14 PROJECTOR-BASED AUGMENTED REALITY ......................................................................... 14 Projector-Camera Systems ............................................................................................ 16 Mobile, Handheld and Wearable Projector-Camera Systems ....................................... 17 Multi-Projector Display Systems ................................................................................... 18 Steerable Projector-Camera Systems ............................................................................ 19 Interaction with Projector-Camera System Displays .................................................... 22 Projected Display Geometrical Calibration .................................................................. 26 Projection Photometric and Colorimetric Calibration ................................................. 31 Issues with Projector-based Augmented Reality ........................................................... 33 Mobile Objects .............................................................................................................. 36 DETECTION AND TRACKING OF OBJECTS ............................................................................ 37 FIDUCIAL MARKER DETECTION AND TRACKING ................................................................. 37 NATURAL APPEARANCE DETECTION AND TRACKING ......................................................... 40 Tracking ........................................................................................................................ 40 Detection ....................................................................................................................... 42 Multi-Cue Detection and Tracking ................................................................................ 45 Camera Model ............................................................................................................... 46 Pose calculation ............................................................................................................ 48 Hybrid Detection ........................................................................................................... 50 VISION-BASED DETECTION WITH PHYSICAL SENSING ........................................................ 50 Sensing for Pose and Object Motion Prediction ........................................................... 51 Structured Light Sensing for Location and Pose ........................................................... 51 SUMMARY ........................................................................................................................... 53 CHAPTER 3 3.1 3.2 3.3 INTRODUCTION .................................................................................................... 1 COOPERATIVE AUGMENTATION CONCEPTUAL FRAMEWORK ........ 54 COOPERATIVE AUGMENTATION ENVIRONMENT ................................................................. 55 THE OBJECT MODEL ........................................................................................................... 55 THE PROJECTOR-CAMERA SYSTEM MODEL ........................................................................ 56 vi CONTENTS 3.4 COOPERATIVE AUGMENTATION PROCESS ........................................................................... 57 3.4.1 Detection ....................................................................................................................... 59 3.4.2 Projection ...................................................................................................................... 60 3.5 CONCLUSION ...................................................................................................................... 61 CHAPTER 4 4.1 4.2 4.3 4.3.1 4.4 4.4.1 4.4.2 4.4.3 4.4.4 4.5 4.6 NATURAL APPEARANCE DETECTION .................................................................................. 62 OBJECT DETECTION METHODS ........................................................................................... 63 EVALUATION DATASET ...................................................................................................... 64 Object Appearance Library ........................................................................................... 64 SCALE AND ROTATION EXPERIMENTS ................................................................................ 65 Design ........................................................................................................................... 66 Procedure ...................................................................................................................... 67 Apparatus ...................................................................................................................... 69 Results ........................................................................................................................... 69 DISCUSSION ........................................................................................................................ 74 CONCLUSION ...................................................................................................................... 76 CHAPTER 5 5.1 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.4 COOPERATIVE DETECTION ........................................................................... 77 VIDEO TEST LIBRARY ......................................................................................................... 77 COOPERATIVE DETECTION EXPERIMENTS .......................................................................... 78 Design ........................................................................................................................... 78 Training ......................................................................................................................... 79 Procedure ...................................................................................................................... 79 Results ........................................................................................................................... 80 DISCUSSION ........................................................................................................................ 85 Single Cue Detection ..................................................................................................... 86 Multi-Cue Combination................................................................................................. 88 Limitation of Movement Sensing ................................................................................... 89 Use of Sensing in smart objects for Pose Calculation................................................... 89 CONCLUSION ...................................................................................................................... 90 CHAPTER 6 6.1 6.1.1 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6 6.4 6.5 6.6 6.6.1 6.6.2 6.6.3 6.7 6.7.1 6.7.2 6.7.3 VISION-BASED OBJECT DETECTION ........................................................... 62 SYSTEM ARCHITECTURE ................................................................................ 92 ARCHITECTURE DESIGN OVERVIEW ................................................................................... 92 Architecture Novelty...................................................................................................... 94 ARCHITECTURE COMPONENTS ............................................................................................ 95 Detection and Tracking ................................................................................................. 95 Projection ...................................................................................................................... 98 Pan & Tilt Tracking Component ................................................................................... 99 Smart Objects .............................................................................................................. 102 Object Proxy................................................................................................................ 105 Database Server .......................................................................................................... 105 Knowledge Updating ................................................................................................... 107 IMPLEMENTATION ............................................................................................................. 108 Detection and Tracking ............................................................................................... 109 Projection .................................................................................................................... 114 Pan and Tilt Control ................................................................................................... 117 Object Proxy................................................................................................................ 118 Database Server .......................................................................................................... 119 Networking Protocol Implementation ......................................................................... 119 PROJECTOR-CAMERA SYSTEM CALIBRATION ................................................................... 122 DISCUSSION ...................................................................................................................... 124 DETECTION METHOD MEMORY EXPERIMENTS ................................................................. 125 Design ......................................................................................................................... 126 Results ......................................................................................................................... 126 Discussion ................................................................................................................... 127 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS .............. 128 Design ......................................................................................................................... 128 Procedure .................................................................................................................... 129 Apparatus .................................................................................................................... 130 vii CONTENTS 6.7.4 Results ......................................................................................................................... 130 6.7.5 Discussion ................................................................................................................... 133 6.8 CONCLUSION..................................................................................................................... 134 CHAPTER 7 7.1 7.1.1 7.1.2 7.1.3 7.1.4 7.1.5 7.1.6 7.1.7 7.1.8 7.2 7.2.1 7.2.2 7.2.3 7.2.4 7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.4 SMART COOPERATIVE CHEMICAL CONTAINERS ............................................................... 136 Object Model ............................................................................................................... 136 Registration ................................................................................................................. 137 Detection ..................................................................................................................... 137 Projection .................................................................................................................... 138 Manipulating the Object .............................................................................................. 138 Knowledge Updating ................................................................................................... 140 Objects Departing the Environment ............................................................................ 140 Discussion ................................................................................................................... 141 SMART PHOTOGRAPH ALBUM ........................................................................................... 142 Object Model ............................................................................................................... 143 Registration and Detection .......................................................................................... 144 Projection and Manipulation....................................................................................... 144 Discussion ................................................................................................................... 144 SMART COOKING .............................................................................................................. 148 Hard-Boiled Egg Recipe.............................................................................................. 148 Object Model ............................................................................................................... 149 Task Procedure............................................................................................................ 149 Discussion ................................................................................................................... 152 CONCLUSION..................................................................................................................... 155 CHAPTER 8 8.1 8.2 8.3 8.4 DEMONSTRATION APPLICATIONS ............................................................. 136 CONCLUSION ..................................................................................................... 156 CONTRIBUTIONS AND CONCLUSIONS ................................................................................ 156 BENEFITS OF OUR APPROACH ............................................................................................ 158 LIMITATIONS ..................................................................................................................... 159 FUTURE WORK.................................................................................................................. 160 BIBLIOGRAPHY.............................................................................................................................. 162 APPENDIX A STEERABLE PROJECTOR-CAMERA SYSTEM CONSTRUCTION .... 176 A.1 STEERABLE PROJECTOR-CAMERA SYSTEM DESIGN .......................................................... 176 A.2 STEERING MECHANISM ..................................................................................................... 176 A.2.1 Steering Mechanism Characteristics ........................................................................... 177 A.2.2 Moving Mirror and Moving Head comparison ........................................................... 180 A.3 VIDEO PROJECTOR ............................................................................................................ 181 A.4 CAMERA ........................................................................................................................... 184 A.5 COMMERCIAL STEERABLE PROJECTORS ........................................................................... 187 A.6 CONSTRUCTION OF STEERABLE PROJECTOR SYSTEMS ...................................................... 188 A.7 STEERABLE PROJECTOR HARDWARE ................................................................................ 189 A.7.1 Projector Selection ...................................................................................................... 189 A.7.2 Steering Mechanism Selection ..................................................................................... 190 A.7.3 Camera Selection ........................................................................................................ 190 A.8 MOVING HEAD STEERABLE PROJECTOR CONSTRUCTION.................................................. 191 A.8.1 Moving Head Control .................................................................................................. 193 A.9 MOVING MIRROR STEERABLE PROJECTOR CONSTRUCTION .............................................. 194 A.10 CHARACTERISATION OF MOVING HEAD MECHANISM....................................................... 195 A.10.1 Design.......................................................................................................................... 195 A.10.2 Procedure .................................................................................................................... 196 A.10.3 Results ......................................................................................................................... 197 A.10.4 Discussion ................................................................................................................... 197 A.11 CHARACTERISATION OF PROJECTOR ................................................................................. 198 A.12 STEERABLE PROJECTOR SYSTEM COMPARISON ................................................................ 200 A.13 CONCLUSION..................................................................................................................... 202 APPENDIX B B.1 SMART OBJECT PROGRAMMING EXAMPLES .................................... 204 ROUGH HANDLING DETECTION ........................................................................................ 204 viii CONTENTS B.2 B.3 B.4 SMART BOOK .................................................................................................................... 205 SMART FURNITURE ASSEMBLY......................................................................................... 206 SMART COOKING OBJECT MODEL .................................................................................... 208 ix LIST OF FIGURES List of Figures Figure 1.1 Cooperative Augmentation of smart objects with Projector-Camera Systems 3 Figure 2.1 This work draws from 4 main areas of computing: Ubiquitous Computing, Augmented Reality, Computer Vision and Tangible User Interfaces (TUI) ............ 6 Figure 2.2 The "Weiser Continuum" of ubiquitous computing [Weiser 1996] to monolithic computing, taken from [Newman, Bornik et al. 2007] ........................... 6 Figure 2.3 The Milgram and Kishino “Virtuality” Continuum of Reality to Virtual Reality [Milgram and Kishino 1994], taken from [Newman, Bornik et al. 2007] .... 7 Figure 2.4 The Milgram-Weiser Continuum [Newman, Bornik et al. 2007].................... 8 Figure 2.5 (left) Smart-Its design (centre) Particle Smart-Its Device main board, (right) Add-on sensor boards [Decker, Krohn et al. 2005]................................................... 9 Figure 2.6 (left) Crossbow Technology Motes (right) Motive Corporation (now Sentilla Corporation) Telos Mote [Inc 2007] ...................................................................... 10 Figure 2.7 (left) Mediacup Object [Gellersen, Beigl et al. 1999], (centre) Cooperative Chemical Containers, (right) Hazard warning display using 3 LEDs [Strohbach, Gellersen et al. 2004] .............................................................................................. 11 Figure 2.8 (left) Proactive Furniture Assembly [Antifakos, Michahelles et al. 2004], (centre) Display Cube Object [Terrenghi, Kranz et al. 2006], (right) Tuister Object [Butz, Groß et al. 2004] ........................................................................................... 11 Figure 2.9 Using a sensor enabled Cubicle for interaction [Block, Schmidt et al. 2004] ................................................................................................................................. 13 Figure 2.10 (left) Prototype mock-up display cube using static front-projection, (centre) The current LED-matrix display Z-agon interactive cube, (right) The envisioned fully interactive cube with colour displays [Matsumoto, Horiguchi et al. 2006].... 13 Figure 2.11 (left) A chessboard with memory (centre) Interactive optics-design, (right) A prototype I/O bulb projector-camera system in the Luminous Room [Underkoffler, Ullmer et al. 1999] .......................................................................... 15 Figure 2.12 Spatially Augmented Reality (SAR) [Raskar, Welch et al. 1999]............... 16 Figure 2.13 The Projector-Camera system family decomposed by mobility and display size, compared to traditional desktop monitors ....................................................... 17 Figure 2.14 (left and centre) The iLamps Project developed a handheld projectorcamera system, (right) Detecting circular fiducial markers to augment a wall scene with projection......................................................................................................... 18 x LIST OF FIGURES Figure 2.15 SONY’s Handheld spotlight projector [Rapp, Michelitsch et al. 2004], (centre left) Multi-user interaction using handheld projectors [Cao, Forlines et al. 2007], (right) Wearable Projector-camera system [Karitsuka and Sato 2003] ....... 18 Figure 2.16 (left) The PixelFlex reconfigurable multi-projector display [Yang, Gotz et al. 2001], (right) Ad-hoc projector clustering and geometric correction for seamless display with iLamps [Raskar, VanBaar et al. 2005] ............................................... 19 Figure 2.17 IBM’s Everywhere Display demonstration at SIGGRAPH [Pinhanez 2001] ................................................................................................................................ 19 Figure 2.18 (left) Messaging notification, (centre) dynamic navigation signage, (right) augmented filing cabinet object [Pinhanez 2001]................................................... 20 Figure 2.19 (left and centre) Portable Projected Display Screen [Borkowski 2006], (right) Personal Interaction Panel [Ehnes, Hirota et al. 2004] ................................ 20 Figure 2.20 IBM’s Steerable Interface EDML Framework ............................................ 21 Figure 2.21 The IBM Steerable Interfaces Characterisation [Pingali, Pinhanez et al. 2003] ....................................................................................................................... 22 Figure 2.22 Interaction Techniques for Projected Displays ........................................... 22 Figure 2.23 Fingertip matching in (a) Camera image, (b) Motion mask, (c) Finger tip template match [Pinhanez, Kjeldsen et al. 2001].................................................... 25 Figure 2.24 (far left) Polar Motion Map, (left) the corresponding segments unrolled to a rectangle, (right) PMM in use [Kjeldsen 2005] ...................................................... 25 Figure 2.25 (left) Using a red paper slider widget [Kjeldsen 2005], (right) Increasing camera exposure to compensate for projector light [Letessier and Bérard 2004] .. 26 Figure 2.26 (left) On-axis projection gives a rectangular image, (right) Geometric distortion due to horizontal projector rotation when projecting onto a planar screen ................................................................................................................................ 26 Figure 2.27 Classification of projector-camera system geometrical calibration methods, based on [Borkowski, Riff et al. 2003] ................................................................... 27 Figure 2.28 Model-based distortion correction based on known surface geometry and pose relative to the projector [Pinhanez 2001] ....................................................... 28 Figure 2.29 Projecting dots onto a planar screen to recover the projector-camera homography ............................................................................................................ 28 Figure 2.30 (left) Creating the projector-camera homography, (right) Using the combined projector-screen transformation to project an undistorted image [Sukthankar, Stockton et al. 2001].......................................................................... 29 Figure 2.31 (left) Gray-code structured light patterns (centre) Geometrically corrected multi-projector display on curved display [Raskar, VanBaar et al. 2005], (right) Automatic Projector Display Surface Estimation Using Every-Day Imagery [Yang and Welch 2001] ..................................................................................................... 30 Figure 2.32 (left) Colour correction to project on any surface [Bimber, Emmerling et al. 2005], (right) Real-time adaptation for mobile objects [Fujii, Grossberg et al. 2005] ................................................................................................................................ 32 Figure 2.33 (left and centre) Uneven illumination on a non-planar surface, (right) uncompensated and intensity compensated image [Park, Lee et al. 2006]............. 33 Figure 2.34 VRP used to overcome occluding light shadows [Summet, Flagg et al. 2007] ....................................................................................................................... 34 xi LIST OF FIGURES Figure 2.35 (left) Interactive buttons projected on a variety of everyday surfaces, (right) Dynamic Shader Lamps projecting on mobile tracked object [Bandyopadhyay, Raskar et al. 2001] ................................................................................................... 35 Figure 2.36 (left) Examples of different Fiducial Marker systems [Fiala 2005], (right) Using ARToolkit Markers with a HMD to view a virtual character [Kato and Billinghurst 1999] ................................................................................................... 38 Figure 2.37 ARToolkit Recognition and Overlay method [Kato and Billinghurst 1999] ................................................................................................................................. 38 Figure 2.38 (left and centre) Occlusion of ARToolkit marker prevents tracking, (right) Occlusion of ARTag marker [Fiala 2005].............................................................. 39 Figure 2.39 (left) AR Toolkit fails to detect markers under different illumination conditions in the same frame [Fiala 2005], (right) Ehnes et al. track ARToolkit markers with a steerable projector and project on modelled surfaces in the marker area [Ehnes, Hirota et al. 2004] ............................................................................... 40 Figure 2.40 (left) Drummond and Cipolla’s RAPiD-like algorithm, (right) robustness to partial occlusion [Drummond and Cipolla 2002] .................................................... 41 Figure 2.41 (left) Two dissimilar objects and their Mag-Lap histograms corresponding to a particular viewpoint, image plane rotation and scale. [Schiele and Crowley 2000], (right) Shape Context 2D log-polar histogram based on relative point locations [Belongie, Malik et al. 2002] ................................................................... 43 Figure 2.42 (left) SIFT local feature based AR method, (right, top) SIFT Features detected on a mug, (right, bottom) AR teapot added to camera display [Gordon and Lowe 2004] ............................................................................................................. 45 Figure 2.43 (left) Pinhole Perspective Camera Model, (right) Checkerboard pattern for camera calibration with overlaid pattern coordinate system ................................... 47 Figure 2.44 (left) Projection on mobile planar surfaces with single light sensor [Summet and Sukthankar 2005], (right) Projection onto surfaces with sensors at each corner for rotation information [Lee, Hudson et al. 2005] ................................................. 52 Figure 3.1 Detection Sequence Diagram......................................................................... 58 Figure 3.2 Projection Sequence Diagram ........................................................................ 58 Figure 3.3 Knowledge flow in the detection process ...................................................... 59 Figure 4.1 Experiment Objects (left to right, top to bottom): a football, a chemical container barrel, a book, a product box, a smart cube, a chair, a cup, a notepad, a cereal box and a toaster. .......................................................................................... 64 Figure 4.2 (left) Box Object Scale images at 1m, 3m, 6m from camera, (right) Notepad object rotation images at -40°, 0°, +40° .................................................................. 64 Figure 4.3 Different numbers of corner features (yellow dots) are detected and in different locations on the Cereal box object at 1m, 3m and 6m distance with the single-scale Harris algorithm [Harris and Stephens 1988]...................................... 65 Figure 4.4 Camera and Object Detection Coordinate System Transformations ............. 65 Figure 4.5 (left) Number of features detected on Mediacube object by Harris-based algorithms, (right) Harris-based algorithm detector repeatability when trained at 4m ............................................................................................................................ 70 Figure 4.6(left) Harris detector with multiple training images 1.1m,1.85m,3m,5m, (right) Mean repeatability with varying training distance for Harris-based detector algorithms ................................................................................................................ 71 xii LIST OF FIGURES Figure 4.7 (left) SIFT Descriptor matching percentages with 2D rotation, (right) Number of features detected with 3D rotation........................................................ 71 Figure 4.8 (left) Detector repeatability for 3D rotation, (right) Descriptor matching percentages for 3D rotation..................................................................................... 72 Figure 4.9 (left) Detection performance over all objects for SIFT local features at 1m, 2m and 6m training distances, (right) Detection performance over all objects for Shape Context at 1m, 3m and 6m training distances. ............................................. 73 Figure 4.10 (left) Detection performance over all objects for Texture (Mag-Lap) Multidimensional Histograms at 1m, 2m and 6m training distances, (right) Detection performance over all objects for LAB Colour Histograms at 1m, 3m and 6m training distances. ............................................................................................. 73 Figure 4.11 (left) Detection percentage repeatability of SIFT (DoG) over all objects when varying 3D rotation angle, (right) Detection performance over all objects when varying 3D rotation angle, for Texture, Colour and Shape, at 3m training and test distances. .......................................................................................................... 74 Figure 5.1 Video Test library images of chemical container (top) and cereal box (bottom). ................................................................................................................. 77 Figure 5.2 Graph of detection algorithm results without sensing (orange) and with movement sensing (blue), averaged over all objects. Error bars show 95% Confidence level (Cl) of mean. ............................................................................... 80 Figure 5.3 (left) SIFT Local Feature Detection Performance with and without sensing for each object in test library, (right) Mag-Lap Texture Detection Performance with and without sensing for each object in test library ......................................... 82 Figure 5.4 (left) Lab Colour Detection Performance with and without sensing for each object in test library, (right) Shape Context Detection Performance with and without sensing for each object in test library ........................................................ 82 Figure 5.5 Detection Performance improvement using multiple cues, No Sensing ....... 85 Figure 5.6 Detection Performance improvement using multiple cues and Movement Sensing .................................................................................................................... 85 Figure 6.1 Distributed System Architecture Overview .................................................. 93 Figure 6.2 Detection method selection based on smart object knowledge ..................... 96 Figure 6.3 The 4 coordinate systems: Camera, Projector and smart object Local Coordinate Systems, and the arbitrary World Coordinate System ......................... 97 Figure 6.4 Search Methods: (left) Creeping Line Search over the whole Pan and Tilt Field Of View, (right) Expanding-Box search from previous object location ..... 100 Figure 6.5 Two views of a Lab Environment: (left) South-West Elevation, (right) NorthEast Elevation ....................................................................................................... 108 Figure 6.6 The moving-head steerable projector-camera system ................................. 109 Figure 6.7 View down into a background model re-projected into a hemisphere, captured by rotating a moving head steerable projector through a 360x90º FOV. .............................................................................................................................. 111 Figure 6.8 (left) Image of detected of Book object with overlaid 3D model and object coordinate system axes, (right) Four detected objects tracked simultaneously - the Chemical Container, the Book , the Card and the Cup (green lines denote the maximum extent of the 3D model). ...................................................................... 113 xiii LIST OF FIGURES Figure 6.9 (left) 3 surface projection on box object, (right) Sensed temperature projected on the non-planar smart mug surface - the blue wire at the right is the antenna of the Smart-Its device ............................................................................................... 116 Figure 6.10 Architecture Message Protocol and Routing. ............................................ 120 Figure 6.11 (left) Composite image of ARToolkit marker aligned with 5 far calibration locations and handheld for one of the near calibration locations, (right) Projector Calibration using projected pattern on planar surface [Park, Lee et al. 2006] ...... 124 Figure 6.12 The location accuracy and jitter experiment object grid locations, orthoganal to the camera. Object test locations are the green circles. .................. 129 Figure 6.13 Error in Z-axis location for objects at a large distance from the camera causes smaller error in the X,Y location error of the projection ........................... 134 Figure 7.1 (left) New objects arrives in environment, (centre) An employee walks with containers, (right) The employee places one object on the floor .......................... 137 Figure 7.2 (left) Detected container with green wireframe 3D Model superimposed on the camera view using calculated pose, (right) A warning message projection on two chemical containers ........................................................................................ 138 Figure 7.3 Partial Object Model representation of Chemical Container Demonstrator 139 Figure 7.4 (left) Scale and rotation invariant local features detected on chemical containers, (right) A container leaves the environment with the employee .......... 140 Figure 7.5 Photograph Album smart object with Projection (left) front cover, button state 10 (centre) being opened, (right) inside, button state 5 ................................ 143 Figure 7.6 Partial Object Model representation of Photo Album Demonstrator .......... 146 Figure 7.7 Photograph Album Content to Project ......................................................... 147 Figure 7.8 (left) Recipe State 0, (right) Add Egg Projection inside pan ....................... 150 Figure 7.9 Recipe State 8 .............................................................................................. 151 Figure 7.10 (left) Example salt 3D model, (right) 3D model with added base projection area ........................................................................................................................ 153 Figure A.1 (left) Moving mirror display light, (right) Moving head display light [SteinigkeShowtechnicGmbH 2005]..................................................................... 177 Figure A.2 (left) The IBM Everywhere Display Steerable Projector Design, (right) Everywhere Display Projection Cone [Pinhanez 2001] ........................................ 178 Figure A.3 (left) Single chip DLP projector optics [TexasInstruments 2005], (right) Mitsubishi Pocket Projector [MitsubishiElectricCorporation 2008] ..................... 183 Figure A.4 (left) A Co-axial projector-camera system [Fujii, Grossberg et al. 2005] (centre) Camera Resolution Test Chart [Geoffrion 2005], (right) Bayer Pattern Colour Filter Array [Geoffrion 2005] ................................................................... 185 Figure A.5 (left) Futurelight PHS150 Single arm Yoke, (right) Compulite “Luna” dual fork Yoke............................................................................................................... 190 Figure A.6 The pan and tilt yoke uncovered ................................................................. 191 Figure A.7 Steerable Projector Cabling Layout ............................................................ 192 Figure A.8 (left) The fabricated projector bracket attached to the yoke, (centre) FLUIDUM room projector mounting direct to concrete [Butz, Schneider et al. 2004], (right) Wooden Bracing of the four threaded rods from which the projector is hung ................................................................................................................... 193 xiv LIST OF FIGURES Figure A.9 (left) The operational moving head steerable projector system, (centre) The tripod-mounted moving mirror steerable projector system, (right) Close-up of operational moving mirror yoke and pan-tilt camera ........................................... 194 Figure A.10 Moving Mirror Steerable Projector Cabling............................................. 195 Figure A.11 (left) Image Size versus Distance at Zoom 0, (right) Image Size versus Distance at Zoom 20 (mid-zoom) ......................................................................... 199 Figure A.12 (left) Display resolution versus distance to projection surface at Zoom 0, (right) Display resolution versus distance to projection surface at Zoom 20 ....... 199 Figure A.13 (left) Focus steps versus distance to display surface at Zoom 0, (right) Focus steps versus distance to display surface at Zoom 20 .................................. 200 Figure B.1 State Machine Program for Detecting Rough Handling of a smart object . 205 Figure B.2 State Machine Program for Articulated Smart Book with Force Sensors on Each Page .............................................................................................................. 206 Figure B.3 Smart Furniture Assembly Program for 3 Individual Pieces ...................... 207 Figure B.4 Hard-Boiled Egg Recipe States 0-2 ............................................................ 208 Figure B.5 Hard-Boiled Egg Recipe States 3-5 ............................................................ 209 Figure B.6 Hard-Boiled Egg Recipe States 6-8 ............................................................ 210 Figure B.7 Hard-Boiled Egg Recipe States 8-10 .......................................................... 211 Figure B.8 Hard-Boiled Egg Recipe States 10-12 ........................................................ 212 xv LIST OF TABLES List of Tables Table 5.1 Mean algorithm runtime per frame, averaged over 200 frames ...................... 81 Table 5.2 Mean time to detection from first entry to the environment ........................... 81 Table 5.3 Ranking of Cues by Detection Results for Each Object, No Movement Sensing .................................................................................................................... 84 Table 5.4 Ranking of Cues by Detection Results for Each Object, with Movement Sensing .................................................................................................................... 84 Table 5.5 Guidance for which method to use, based on object appearance type ............ 87 Table 6.1 Appearance knowledge levels and detection methods with associated processing cost ........................................................................................................ 95 Table 6.2 Geometric correction methods for projection based on object geometry [Bimber and Raskar 2005] ...................................................................................... 98 Table 6.3 Memory requirements for Appearance Knowledge and the total Object Model with single viewpoint (minimum) and full viewing sphere (maximum) .............. 127 Table 6.4 Median Rotation Error results for each object, over -70° to +70° Out-of-plane and 10° to 350° in-plane. ....................................................................................... 131 Table 6.5 Median Location Calculation Error in X,Y,Z Camera Coordinate System for each object, averaged over all grid locations ........................................................ 131 Table 6.6 Median Projection Location Error on the Object X,Y front surface plane for each object, averaged over all grid locations. ....................................................... 131 Table 6.7 Median 3D Object Location Jitter from Median Location for each object, averaged over all grid locations. ........................................................................... 132 Table 6.8 Median Projection Location Jitter from Median Location on the Object front surface plane for each object, averaged over all grid locations. ........................... 132 Table A.1 Four commercial steerable projectors .......................................................... 187 Table A.2 Projector Specifications................................................................................ 189 Table A.3 Yoke Selection Specifications ...................................................................... 190 Table A.4 Moving Head Steerable Projector Characterisation ..................................... 197 Table A.5 Steerable Projector Steering Mechanism Comparison ................................. 201 Table A.6 Steerable Projector Video Projector Comparison ........................................ 201 Table A.7 Steerable Projector Camera Comparison ..................................................... 202 xvi 1.1 Chapter 1 SMART OBJECT OUTPUT Introduction One of the central visions of Ubiquitous Computing is that the environment itself becomes the user interface [Ishii and Ullmer 1997]. Interaction will be significantly different than with traditional computing – physical objects, surfaces and spaces themselves will allow us to perform our tasks, while the technology itself becomes transparent, disappearing into the background [DisappearingComputer 2002]. Interaction will no longer be device centric, but information centric, allowing both implicit and explicit interaction with the information represented and sensed by physical objects [Schmidt, Kranz et al. 2005]. Smart objects research explores embedding sensing and computing into everyday objects - augmenting objects to be a source of information on their identity, state, and context in the physical world. While many research projects have focussed on adding input capabilities to objects, less attention has been paid to adding output, creating an imbalance in the interface. Giving an object output capability allows users to address tasks in physical space with direct feedback on the objects themselves. This thesis investigates how we can support physical objects simultaneously as input and output medium, redressing the imbalance in the interface. The first chapter presents a short motivation for the work, a section on the specific problems we identified, describes our approach and the contributions we make. 1.1 Smart Object Output As the interest in creating smart objects grows they are expected to bridge the gap between the physical and digital world, becoming part of out lives in economically important areas such as retail, supply chain or asset management [Lampe and Strassner 2003; Siegemund and Flörkemeier 2003; Decker, Beigl et al. 2004] and health and safety monitoring in work places [Strohbach, Gellersen et al. 2004; Kortuem, Alford et al. 2007]. A major challenge for the design of smart objects is to preserve their original appearance, purpose and function, thus exploiting natural interaction and a user’s familiarity with the object [Schmidt, Strohbach et al. 2002]. In many cases we do want to augment an object with digital information, for example, to reveal otherwise hidden information stored inside the object or give direct visual feedback to a user from object manipulation. An ability for objects to function as both input and output medium simultaneously enables scenarios such as objects that monitor their physical condition and visually display warnings if these are critical [Strohbach, Gellersen et al. 2004]. However, delivering this visual information to the user by adding output capability to objects is a challenge. 1 1.2 COOPERATIVE AUGMENTATION One approach is to routinely embed displays in objects. For example, thin, flexible displays such as e-paper are expected to become common in a few years time. However, this approach is expensive, especially if the object is disposable, and inflexible, as the product designer must choose at design-time which surfaces are to be augmented. An additional problem is that embedded displays change an object’s appearance and can change the way an object is used. For example, consider that adding a display to a smart cup would prevent it being put in the dishwasher unless the display was removable. Either way, we have changed the object’s natural appearance and function. Another way we could visualise this information is by using Augmented Reality display devices, such as head mounted displays (HMD), tablet PCs, PDAs or mobile phones. However, carrying special purpose devices is encumbering for a user and limits interaction to a single person, whereas there are many times when it would be beneficial for a group of users to see the same display simultaneously. In contrast, the recent availability of small, cheap and bright video projectors makes them practical for augmenting objects with non-invasive projected displays. By adding a camera and using computer vision techniques, a projector-camera system can also dynamically detect and track objects [Borkowski, Riff et al. 2003; Ehnes, Hirota et al. 2004], correct for object surface geometry [Pinhanez 2001; Borkowski, Riff et al. 2003; Ehnes, Hirota et al. 2004; Bimber and Raskar 2005], varying surface colour and texture [Fujii, Grossberg et al. 2005] and allow the user to interact directly with the projected image [Kjeldsen, Pinhanez et al. 2002]. The use of a projector-camera system does not rely on adding to or modifying the hardware of the object itself, instead creating temporary displays on objects in the environment without permanently changing their appearance or function. The traditional way of augmenting objects with projector-camera systems (such as that taken by Pinhanez [Pinhanez 2001] and Raskar et al. [Bandyopadhyay, Raskar et al. 2001; Raskar, Beardsley et al. 2006]) is to store all information about the object in the projector system itself. Usually such systems are installed as infrastructure in the environment, creating a smart environment. This approach reduces flexibility, as it requires a-priori knowledge about all possible objects which can enter the environment. This results in large databases of objects and consequently higher possibility of interobject confusion during detection. 1.2 Cooperative Augmentation The core contribution of this thesis is the development of a new approach called Cooperative Augmentation to support physical objects simultaneously as input and output medium, as illustrated in Figure 1.1. Instead of storing knowledge about objects in a smart environment, the Cooperative Augmentation concept distributes knowledge into smart objects. Hence, the intelligence becomes embodied in the smart objects inhabiting a space, not the space itself. By moving the knowledge and intelligence from the environment to the smart object a projector-camera system no longer needs a-priori knowledge of all objects. This allows us to make projector-camera systems ubiquitous, as they merely offer a generic display service to all smart objects. In the Cooperative Augmentation approach cooperation between smart object and projector-camera system plays a central role in detection, tracking and projection. The objects themselves become pro-active clients of the environment, allowing spontaneous 2 1.3 CHALLENGES use of projection for output capability by requesting use of the projector-camera display service. Continued cooperation allows smart objects to describe information to the projector-camera system vital to the visual-detection and projection process, such as knowledge of their appearance. This knowledge is used to dynamically tailor the projector-camera services to each object. Many objects possess embedded sensors designed for specific purposes. In the Cooperative Augmentation approach sensor information from the object can be serendipitously integrated in the detection and tracking process. The additional information from sensing allows projector-camera systems to dynamically constrain the detection process and increase visual detection performance. After detection the smart object controls the interaction with projector-camera systems. The smart object issues projection requests to the projector-camera system, controlling how the projected output on its surfaces changes and allowing direct visual feedback to interaction. Projector Camera Wireless Link Sensors Smart Object Figure 1.1 Cooperative Augmentation of smart objects with Projector-Camera Systems 1.3 Challenges The key concepts of Cooperative Augmentation discussed above raise questions which we address in this thesis. The initial question is how we can model the object and projector-camera system so that objects can cooperate by describing relevant aspects of their knowledge. This modelling must allow applications using the Cooperative Augmentation approach to be written independent of a particular system, but adapt themselves to it. A central problem to achieving displays on everyday objects (smart or not) is also their detection and tracking. Only when an object is detected can the projector-camera system align its projection so the image is registered with the object’s surfaces. In this thesis we use vision-based detection as it does not require adding dedicated location system hardware to every object to enable detection. In experimental prototypes, vision-based detection is commonly achieved with fiducial markers [Ehnes, Hirota et al. 2004]. However, with a view to ubiquitous augmentation of objects using the Cooperative Augmentation approach, it is more realistic to base detection on the natural appearance of objects. This detection is a significant challenge in real-world environments, as objects naturally vary in their appearance. For example, one book could have a red cover, while a second book has a 3 1.4 CONTRIBUTIONS blue cover. Similarly, one book could be open and one closed. Here the book objects are fundamentally identical in concept and use, but vary in both appearance and form. Another challenge is that objects can appear at any distance and orientation with respect to a tracking camera. For example, we can see detail on a book cover when it is close to the camera, but this detail disappears when it is far away. Similarly, when we change our viewpoint from looking at the cover of the book to looking at its side we now see white pages instead of a colourful cover. To understand how best to detect and track smart objects using natural appearance detection methods we must therefore study the impact of scale and rotation. One of the limitations of vision-based detection is that it has many failure modes. Some common reasons for detection failure include changes in object appearance, changing illumination, occlusion of the object, distractions in the image and fast movement of the camera or object. The Cooperative Augmentation approach enables integration of sensor information in the detection process, but to best understand how to integrate this information we must study how this sensing can be best exploited to increase detection performance. 1.4 Contributions Giving an object output capability is valuable as it allows users to address tasks in physical space and receive direct feedback on the objects themselves, where it belongs in the real world. However, adding output capability to smart objects while preserving their original appearance and functions is challenging. To address this challenge we present the Cooperative Augmentation approach, contributing the following to the area of ubiquitous computing: 1. The novel concept of Cooperative Augmentation, which enables: a. Generic projection services in the environment. b. Spontaneous use of projection capability by smart objects. c. Detection and tracking of mobile objects in the environment. d. Non-invasive output capability on smart objects. e. Use of smart object as input and output interface simultaneously. 2. Validation of the Cooperative Augmentation concept through: a. A system architecture specifying the different cooperating components. b. Implementation of the system architecture. c. Three demonstration applications. 3. An investigation of natural appearance vision based detection, providing: a. A study to understand the impact of object scale and rotation on different natural appearance detection methods. b. Insight into training requirements of different detection approaches. 4. An investigation of the integration of embedded sensing in the detection process, providing: a. A study to analyse the increase in detection performance achieved with movement sensing in the target object. b. Insight into the requirement for multiple detection approaches. 4 1.5 1.5 THESIS STRUCTURE Thesis Structure This thesis explores the cooperative augmentation of smart objects with displays using projector-camera systems: Chapter 2 provides an overview of related work relevant for understanding the thesis in the four areas of Ubiquitous Computing, Augmented Reality, Computer Vision and Tangible User Interfaces. Here we aim to place our work relative to other research and explain some of the reasoning behind our approach. Chapter 3 presents the Cooperative Augmentation conceptual framework in more detail, specifically the Object Model, the projector-camera system model and the cooperation process to achieve interactive displays on smart object surfaces. Chapter 4 investigates four natural appearance computer vision object detection approaches and presents an experimental study exploring the issues of scale and rotation to enable detection of objects at any distance from the camera and in any pose. We also create an object appearance library for use in our studies. Chapter 5 presents an experimental study exploring cooperative detection between the vision detection system and smart object, specifically analyzing the increase in detection performance achieved with basic movement sensing in the target object. We also create a video test library for use in our studies. Chapter 6 presents the architectural design and implementation of the conceptual framework developed in Chapter 3, and evaluates the implementation as a whole. Chapter 7 presents three demonstration applications created using the architecture. These applications are developed to demonstrate three different areas of the architecture, specifically scenarios presenting the whole system from an object entry to exit; interaction methods and a scenario with multiple-projectors and multiple-objects. Finally, Chapter 8 presents a summary of the thesis, the limitations of our approach and implementation, our conclusions and our future work plans. Appendix A additionally describes the design and assembly of two steerable projector-camera systems, constructed for our experimental studies and demonstration applications. We identify characteristics of typical systems and present recommendations for building similar equipment. We characterise our system and compare the performance with other related research projects. Appendix B provides additional examples of programming smart object state models for use with the Cooperative Augmentation framework. 5 2.1 INTRODUCTION Chapter 2 Related Work This chapter does not provide an exhaustive list of related research projects; instead it aims to cover the most relevant related work for understanding the thesis and to outline this thesis’ contribution in terms of other projects and similar research. 2.1 Introduction As shown in Figure 2.1, this work draws on four areas of computing: Ubiquitous Computing, Augmented Reality, Computer Vision and Tangible User Interfaces (TUI). The main area is Ubiquitous Computing, as seen on the far left in Figure 2.2, which was defined by Mark Weiser as the third wave of computing after monolithic and personal computing. This third wave involves the technology receding into the background and many computing devices being unobtrusively embedded in the environment to enrich our lives [Weiser 1996]. Figure 2.1 This work draws from 4 main areas of computing: Ubiquitous Computing, Augmented Reality, Computer Vision and Tangible User Interfaces (TUI) Figure 2.2 The "Weiser Continuum" of ubiquitous computing [Weiser 1996] to monolithic computing, taken from [Newman, Bornik et al. 2007] 6 2.1 INTRODUCTION The second area is augmented reality, which is a type of the Mixed-Reality (MR) visual display. Milgram and Kishino describe MR as a general set of technologies which involve the merging of real and virtual worlds [Milgram and Kishino 1994]. These technologies can be decomposed with respect to the amount of reality present in the interface to create the “virtuality” continuum, as shown in Figure 2.3. The continuum ranges from a zero virtuality interface in the real world (such as a pen and paper interface) to fully virtual environments where a user is immersed in surrounding virtual world (such as CAVEs). In contrast, Augmented Reality (AR) refers to cases where either a display of the real environment, or objects in the real world themselves are augmented by means of virtual objects, using computer graphics [Milgram and Kishino 1994]. This allows a shift in focus for human-computer interaction from a static interaction with a user sat at a desktop display, to one where the surrounding environment and physical objects in this environment become the interface. Typical applications for AR include navigation, visualisation and annotation for medical, assembly, maintenance and repair tasks, entertainment, education, mediated reality, collaboration of distributed users and simulation. In this thesis we concentrate on projector-based AR, however, a good general introduction to all aspects of AR can be found in [Azuma 1997; Azuma, Baillot et al. 2001]. Figure 2.3 The Milgram and Kishino “Virtuality” Continuum of Reality to Virtual Reality [Milgram and Kishino 1994], taken from [Newman, Bornik et al. 2007] To achieve an AR display so it appears correct for a user requires that the registration between the real and virtual world is exact. Consequently, for mobile users or dynamic environments the relative positions of the real and virtual must be continuously determined. This can be accomplished either by detecting and tracking objects, users, or both, depending on the scenario. Different approaches to tracking are possible, including using dedicated hardware such as mechanical, electro-magnetic and optical tracking systems [Rolland, Baillot et al. 2001]. In contrast, vision-based tracking using cameras allows a relatively low-cost and non-invasive approach, either using fiducial markers or markerless computer vision techniques [Lepetit and Fua 2005]. Traditionally, head mounted displays (HMD) have been used to view augmented reality displays [Kato and Billinghurst 1999]. However, HMD typically suffer from a number of drawbacks, such as limited Field Of View (FOV) due to their optics, low resolution due to the miniature displays and heavy weight. More recently, PDAs, tablet PCs and mobile phones have been used [Wagner and Schmalstieg 2006; Schmalstieg and Wagner 2007]. In addition to the encumbrance of the physical devices, these technologies share the problem of accommodation, as the augmentation appears at a different location and depth than the objects in real world. These technological and ergonomic drawbacks prevent them from being used effectively for many applications [Bimber and Raskar 2005]. In contrast, projector-camera systems and projector based AR offers a possible solution, enabling displays directly on the surfaces of objects in the real world. 7 2.2 UBIQUITOUS COMPUTING Figure 2.4 The Milgram-Weiser Continuum [Newman, Bornik et al. 2007] Although Weiser believed ubiquitous computing to be the opposite of Virtual Reality (VR), Newman et al. point out that VR is merely at one extreme of the Milgram continuum, and propose merging the two continuums to create the Milgram-Weiser Continuum (as shown in Figure 2.4). In this thesis we use projection to augment objects with interactive interfaces. The objects themselves are smart and cooperate with projector-camera systems to help with their detection and projection task. This places the work firmly in the Ubiquitous AR category of the Milgram-Weiser Continuum. To achieve the augmentation of the smart objects we use markerless computer vision techniques to detect and track the objects in the environment. Through our cooperative augmentation framework the objects themselves can become Tangible User-Interfaces (TUI), where location, orientation, object geometry, projected interfaces and sensors become methods for interacting directly with the projected displays, and hence, otherwise hidden knowledge in the object and environment. The following sections of this chapter investigate these areas in more detail. 2.2 Ubiquitous Computing This section describes the closely related fields of ubiquitous computing (responsible for the drive to make everyday objects smarter) and tangible user interfaces. In recent years the ubiquitous computing field has become broader; however, in this thesis we concentrate on augmenting physical-tangible smart objects with projected displays. Here we examine what smart objects are, investigate some typical uses and concentrate on understanding how the potential for output capability can benefit objects and users. Recent technological advances have allowed computing devices to be miniaturised enough to be embedded into everyday objects. The Smart-Its project envisioned these 8 2.2 UBIQUITOUS COMPUTING could be attached to objects to augment them with a "digital self" [Holmquist, Gellersen et al. 2004]. So called “smart objects” would have the ability to perceive their environment through sensors and provide resources to nearby users and objects via peer-to-peer communication and customisable behaviour. However, the computing itself is secondary to the object, with the computer placed in the background of a users’ interaction with the physical and social environment [Beigl and Gellersen 2003]. Mark Weiser proposed that “the real power of the concept comes not from any one of these devices: it emerges from the interaction of all of them. The hundreds of processors and displays are not a ‘user interface’ like a mouse and windows, just a pleasant and effective ‘place’ to get things done” [Weiser 1991], hence, collections of smart objects become a collaborative interactive experience. Sensors providing objects with awareness of physical context and communication allows objects to promote a digital presence and to become part of networked applications. 2.2.1 Sensor Nodes There are many experimental sensor node platforms in use for augmenting everyday objects with sensing and computation, however here we concentrate on two typical devices – Smart-Its and Motes. The Smart-Its platform for embedded computing has developed over the years, with early DIY versions [Strohbach 2004] and more-recent versions such as the Particle computer [Decker, Krohn et al. 2005]. A basic objective of Smart-its platform is to be generic, to enable operation in mobile settings and have ad-hoc peer-to-peer interoperation, while allowing customisation of sensors, perception and contextawareness methods. This is achieved using a flexible system with hardware consisting of main-boards with a PIC18F6720 microcontroller for processing, a TR1001 transceiver for communication on 868MHz and extendable pluggable sensor boards, as shown in Figure 2.5. Powered is provided by a 1.2V AAA rechargeable battery. Figure 2.5 (left) Smart-Its design (centre) Particle Smart-Its Device main board, (right) Add-on sensor boards [Decker, Krohn et al. 2005] The Particle device has a ball switch on the main board, which can be used for applications such as movement detection or power saving. Based on an analysis of typical sensor nodes applications the Smart-Its design developed two sensor boards. Sensor board (a) in Figure 2.5 (right) and Sensor board (b) which adds on-board processing with a PIC18F452 microcontroller, enabling applications that require more processing power. Both boards have a range of sensors available, including accelerometer sensors (2D or 3D), temperature sensor, light sensor (visible and IR), microphone, and force sensor. For actuation, the main board carries 2 controllable LEDs and a speaker for audio notification. More information on Smart-Its can be found in related publications [Beigl and Gellersen 2003; Holmquist, Gellersen et al. 2004; Decker, Krohn et al. 2005]. 9 2.2 UBIQUITOUS COMPUTING Figure 2.6 (left) Crossbow Technology Motes (right) Motive Corporation (now Sentilla Corporation) Telos Mote [Inc 2007] Motes were initially developed in a collaboration between the University of California, Berkley and Intel Research [Nachman, Kling et al. 2005]. Motes are small, self-contained battery-powered devices with embedded processing and communication, enabling them to exchange data with one another and self-organize into ad hoc networks. Hence, similar to Smart-Its, the Motes form the building blocks of wireless sensor networks. The Motes typically run a free, open source component-based operating system called TinyOS. Motes are manufactured by Crossbow Technology, Berkley and Sentilla Corporation. The Motes family have a large number of designs, with an equally large number of pluggable daughter-boards. They support many of the same sensors as Smart-Its devices such as the integrated temperature, light and humidity sensors shown in Figure 2.6 (right). More information can be found on the related company and university websites. 2.2.2 Smart Objects The Mediacup was one of the first projects to demonstrate sensing and computation embedded unobtrusively in an everyday object, as shown in Figure 2.7 (left). Here a coffee-cup was made smart by means of an attachable rubber foot containing computing to sense context with movement and temperature sensors, for example, sensing if the user drinks, or plays with the cup [Gellersen, Beigl et al. 1999]. It had the ability to communicate with other devices, such as smart watches and smart door plates to share its context. In this case, if a smart door-plate detected several Mediacups inside the room it displayed a “Meeting” sign, warning others. Through this embedded computing the system gave an added value in the backend, allowing applications such as a coffee cup that provided the location of the user and a map which visualised building activities. This was accomplished without changing the appearance or function of the cup itself or the behaviour of the user. For example, rather than adding a display to the cup (which would change its appearance and prevent it being put in the dishwasher), a smart watch displayed the cup temperature on its LCD. 10 2.2 UBIQUITOUS COMPUTING Figure 2.7 (left) Mediacup Object [Gellersen, Beigl et al. 1999], (centre) Cooperative Chemical Containers, (right) Hazard warning display using 3 LEDs [Strohbach, Gellersen et al. 2004] Strohbach et al. present a scenario for embedded computing in industrial environments [Strohbach, Gellersen et al. 2004]. In this scenario large industrial storage facilities potentially contain thousands of chemical containers with different contents. Health and Safety rules apply to where and for how long these containers can be stored. Instead of augmenting the environment to track all the objects, Strohbach et al. propose embedding the rules directly into the containers and using embedded sensing to detect their state and their location relative to nearby containers. The containers can now cooperate, sharing their context to determine whether they comply with the rules, and detect hazard conditions, such as potentially reactive chemicals stored together, critical mass of containers stored together and when containers are stored outside of approved areas [Strohbach, Gellersen et al. 2004]. Such potentially risky situations require action from the facility employees to avert any danger, but the employees are faced with potential problems – such as how to find one container that has been placed in the wrong place out of thousands, and where to return this particular container to. Strohbach et al. provided visual feedback to employees as three LEDs on the top of the container barrel – green for no hazard, yellow for a warning (e.g. container stored outside the approved area) and red for critical hazards (e.g. reactive chemicals in proximity), as shown in Figure 2.7 (right). Figure 2.8 (left) Proactive Furniture Assembly [Antifakos, Michahelles et al. 2004], (centre) Display Cube Object [Terrenghi, Kranz et al. 2006], (right) Tuister Object [Butz, Groß et al. 2004] The Smart Furniture project developed furniture with embedded computing, sensing and actuation to proactively guide the purchaser with the task of assembly. By attaching computing devices and sensors onto each individual piece of the furniture, the system can both recognise a user’s actions and determine the current state of the assembly [Antifakos, Michahelles et al. 2002; Antifakos, Michahelles et al. 2004; Holmquist, Gellersen et al. 2004]. The task of furniture assembly is inherently complex for an 11 2.2 UBIQUITOUS COMPUTING embedded system, both as the task itself can be accomplished in different ways, and as it must cater for different user experience. For example, beginners may need a full walk-through, while experts may only need to be rescued if they assemble a piece incorrectly. The system augments both tools and different pieces of the furniture with different sensors, for example, a screwdriver with a gyroscope to sense rotation, a horizontal board with an accelerometer for orientation sensing, and a side board equipped with a force sensing resistor (FSR) to measure when the two boards are joined. Hence, the system monitors its state and suggests the next most appropriate action using strips of LEDs for visual feedback, as shown in Figure 2.8 (left). This feedback was a green pattern on both edges of the boards when correctly assembled, a red pattern when a mistake is made, or a flashing green when there are multiple alternatives. After alignment, individual green LED direct the user to tighten screws with the screwdriver. From their user study it was determined that augmented furniture with even simple LED notification allowed both a measureable decrease in assembly time and reduction in assembly errors [Antifakos, Michahelles et al. 2004]. 2.2.3 Tangible User Interfaces While differing from ubiquitous computing, Tangible User Interfaces (TUI) research holds a common concern for physically contextualised interaction. In contrast to ubiquitous computing, TUI research is rarely interested in augmenting everyday objects, but instead producing new artificial objects that fuse together input and output within a single device. TUI propose using these devices to provide an explicit mapping between the physical and virtual worlds, enabling tactile sense exploration and spatial reasoning which exploits the human senses of touch and kinaesthesia, and allows familiar physical actions to be used as input [Ishii and Ullmer 1997]. Central to TUI is the concept of embodying information in tangible objects, where objects serve as tokens or containers for digital information (as demonstrated by Underkoffler et al. in the Luminous Room project, described in section 2.3). Similarly, physical objects can serve as input primitives or tools to manipulate digital information, creating the possibility of representing abstract entities and concepts with physical ones, potentially enabling more efficient cognition of the relationships involved [Valli 2005]. Physical objects can also be used to gather and infer information about the context of the user and their intentions – for example, if a user picks up a tennis racket, they may also be interested in the location of their tennis balls. Lamming and Bohm propose such a system using embedded computing to detect relative proximity between objects and warn a user if they forget items from their bag [Lamming and Bohm 2003]. Such context information can be useful even when not being grasped by a user, for example, as objects may be placed in a particular spatial arrangement for an activity. Here, adding more objects allows more degrees of freedom, and hence gives the user more control over the representation of complex spatial relationships. The physical object properties and affordances (which Norman define as the qualities of an object that allows an individual to perform an action [Norman 1988]) themselves can also be used as part of the user interface – for example, a graspable object such as a cube can provide more than a six degree of freedom input by sensing interactions such as squeezing or twisting in addition to movement and rotation [Sheridan, Short et al. 2003]. The use of such affordances were demonstrated in the Cubicle project, where a tangible cube was augmented with embedded computing and sensors [VanLaerhoven, 12 2.2 UBIQUITOUS COMPUTING Villar et al. 2003; Block, Schmidt et al. 2004]. 3D accelerometers detected orientation and allows detection gestures such as shaking or placing the object down on a surface to control a home entertainment centre, as shown in Figure 2.9. Figure 2.9 Using a sensor enabled Cubicle for interaction [Block, Schmidt et al. 2004] The Display cube, shown in Figure 2.8 (centre), was designed by Terrenghi et al. as a child’s learning appliance that exploited the familiar physical affordances of a cube, while augmenting it with embedded sensing, computing and displays [Terrenghi, Kranz et al. 2006]. The internal cube hardware uses a Smart-Its platform, 3D accelerometers to sense orientation, an LCD display on each face and a speaker for audio feedback. The user interface was designed for games and quizzes, for example, a matching task where a letter was shown on the top face and children had to select the matching character from the other faces. Applications had to be specially designed to cope with the low resolution display and the lack of electronic compass sensors, which prevented sensing of the rotation around the gravity vector. However, in the study Terrenghi et al. found that users could still easily read letters and numbers in any orientation, suggesting that lack of orientation sensing is not a major issue for very simple text and graphics. The user study also illustrated additional benefits of a tangible cube, with children quickly engaging with the games due to the shaking and turning, and cooperative play with children demonstrating to each other the solutions to the tasks. Figure 2.10 (left) Prototype mock-up display cube using static front-projection, (centre) The current LED-matrix display Zagon interactive cube, (right) The envisioned fully interactive cube with colour displays [Matsumoto, Horiguchi et al. 2006] Matsumoto et al. envision a device similar to the Display cube, but with seamless displays covering each side of their Z-agon cube. They believe that users will be able to interact more easily with digital information when presented with an intuitive tangible interface [Matsumoto, Horiguchi et al. 2006]. For their initial prototype they front projected onto a large-scale mock-up, as shown in Figure 2.10 (left). Their current prototype is a 2.5 inch cube using low-resolution LED-matrix displays and 3D 13 2.3 PROJECTOR-BASED AUGMENTED REALITY accelerometers; however, they envision a fully interactive high-resolution video-player and communication device with colour displays, as shown in Figure 2.10 (right). The Tuister was designed as one-dimensional input and output device based on discussions with cognitive psychologists which indicated that cubes may have too many degrees of freedom for users to locate and remember a display position (Sheridan et al. identified 17 non-verbal input affordances [Sheridan, Short et al. 2003]). Instead, the psychologists believe people were likely to loose track of their movements in a complex series of motions [Butz, Groß et al. 2004]. Consequently, Butz et al. designed the Tuister for browsing hierarchical nested menu structures with hardware consisting of two parts – a handle and display part which rotates to scroll through the menus, as shown in Figure 2.8 (right). Embedded in the display part are 6 organic LED displays each with 64x16 pixel resolution to display small symbols or short text. Sensors in a Smart-Its device detect the rotation of the display part with respect to the handle. Butz et al. envision Tuister as a personal multi purpose device, carried to interact directly with pervasive environments and serving as a universal remote control. 2.2.4 Input-Output Imbalance The increasing amount of technology embedded into the environment enables new interaction metaphors beyond the traditional GUI paradigm. In the ubiquitous computing field there has been much research on input to smart objects, for example, on embedding sensing, location systems and methods for accumulating and distributing knowledge. However, there is little research on output methods to allow the user to visualise this knowledge and interact with it. Similarly, tangible user interfaces work allow physical manipulation and spatial arrangement of objects, but visual feedback to the user is mostly still provided by displays in the environment instead of the device itself [Butz, Groß et al. 2004]. This leads to an indirection in the interaction and an input-output imbalance. As we have seen from sections 2.2.2 and 2.2.3, there is a real potential for objects and users to benefit from the delivery of complex and variable visual information on the object itself. This potential can be seen in diverse situations such as hazard monitoring in safety-critical workplaces, for assembly instruction, or for visual feedback to object manipulation in tangible user interfaces. The objects we examined attempted to redress the input-output imbalance either by using other displays in the environment or embedding displays in the physical objects themselves. These embedded displays took different forms, such as simple LEDs (e.g. the Chemical Containers and Smart Furniture) or multiple graphical displays (e.g. the Display cube, Z-agon and Tuister). However, the display technology currently used has many limitations. For example, LED displays can only convey a small amount of information to users (such as flashing to indicate an error condition), while graphical displays are expensive, have high power requirements (so are difficult to integrate in low-power mobile objects) and change the appearance and function of an object. Hence, this motivates our approach in this thesis using non-invasive projected displays. 2.3 Projector-based Augmented Reality The approach we investigate in this thesis uses projector-based augmented reality techniques to dynamically annotate physical and tangible smart objects with interactive projected information, solving their output problem. In this section we aim to 14 2.3 PROJECTOR-BASED AUGMENTED REALITY understand how we can create interactive, undistorted and visible projected displays on objects of varying shapes and appearances. Projector-based Augmented Reality uses front-projection to project images directly onto physical object’s surfaces, not just in a user’s visual field. [Raskar, Welch et al. 1999]. Unlike physical displays such as computer monitors, projection-based displays allow integration with the existing appearance of an object to create seamless displays [Pinhanez and Podlaseck 2005]. The displays they create are non-invasive, as they do not require any hardware in the object being augmented. This feature allows projected displays to be used almost anywhere, in situations where physical displays would not be used, for example, due to cost, vandalism concern or hazardous environments. Projection can be used to change or supplement the functionality of the object - most commonly by transforming a non-augmented object “into an access point to the virtual information space” [Pinhanez 2001]. Projectors are able to create images much larger than the display device itself, allowing even small portable projectors to augment large objects. The traditional AR display technologies (Head Mounted Displays (HMD), PDA, tablets or mobile phones) are inherently encumbering and limited to a single user. In contrast, projected displays have scalable resolution, improved ergonomics, easier eye accommodation (as the graphics appear at the same distance in the real world as the objects) and a wide field of view, so are visible to multiple people. These properties enable a greater sense of immersion and have the potential to increase the effectiveness of multi-user interaction or co-located group-working in an object’s physical space [Bimber and Raskar 2005]. Some of the earliest work in Projector-based Augmented Reality was Underkoffler, Ullmer and Ishii’s Luminous Room project [Underkoffler, Ullmer et al. 1999]. This project developed a concept for providing graphical display and interaction on each surface of an environment. Co-located two-way optical transducers –called I/O bulbs– that consist of projector and camera pairs capture the user interactions and display the corresponding output. The Luminous Room also demonstrated the possibility of embodying information in tangible objects. Here, objects could “save” associated digital information, such as the chess board state, allowing the user to remove the object from the projected surface. The projected interface would then re-appear in the saved state whenever the object was placed back on the surface. Figure 2.11 (left) A chessboard with memory (centre) Interactive optics-design, (right) A prototype I/O bulb projectorcamera system in the Luminous Room [Underkoffler, Ullmer et al. 1999] Underkoffler et al. present several design principles based on their experience in the luminous room. For example, people were often unable to distinguish between the real object and virtual projection. Consequently, Underkoffler et al. recommended subtle 15 2.3 PROJECTOR-BASED AUGMENTED REALITY animation was applied to the projections to make people aware of the virtual nature of the projections. This reduced ability to distinguish between real and virtual was used deliberately in the Spatially Augmented Reality project [Raskar, Welch et al. 1999], which captured a physical object’s inherent colour, texture and material properties (such as reflectance) with a camera, replaced them with a white object and projected imagery exactly reproducing the original appearance of the object. Raskar et al. showed that visually there was no difference between the original object illuminated with white light and the white object illuminated with the original appearance, indicating the ability to separate an object’s appearance from its form. Using this separation Raskar et al. demonstrate alternate appearances, lighting effects and animation by simply changing the projection. This was also one of the first projects to demonstrate that projection is not restricted to a planar surface or a single projector, by augmenting complex 3D objects such as a model representation of the Taj Mahal mausoleum in India with multiple static projectors. Figure 2.12 Spatially Augmented Reality (SAR) [Raskar, Welch et al. 1999] As demonstrated by the I/O bulb and Spatially Augmented Reality project, the addition of a camera to the projector creates a feedback-loop and allows the display to become interactive. Projector-camera systems enable un-encumbered and un-tethered interactive displays on any surface, at any orientation. No special hardware is required in the display surface itself to support this interaction. 2.3.1 Projector-Camera Systems The projection-based AR technology discussed in section 2.2 form part of a larger family of projector-camera systems. Here we decompose the family into three categories with respect to display mobility: 1. Static projector-camera systems (such as multi-projector display walls) 2. Mobile, handheld and wearable projector-camera systems 3. Steerable projection from a static system with pan and tilt hardware These categories can be seen more clearly in Figure 2.13, when compared to a traditional desktop monitor which is a static, single user technology with a small display size. All types of projector-camera system have been used for augmenting objects with projection, however, static [Bandyopadhyay, Raskar et al. 2001], mobile, handheld [Raskar, Beardsley et al. 2006] or wearable [Karitsuka and Sato 2003] (shown in Figure 16 2.3 PROJECTOR-BASED AUGMENTED REALITY 2.15) projector-camera systems can only opportunistically detect and project on objects passing through the field of view of the projector and camera. Figure 2.13 The Projector-Camera system family decomposed by mobility and display size, compared to traditional desktop monitors In contrast, projector-camera systems in the third category, with computer controlled steerable mirrors or pan and tilt platforms [Pinhanez 2001; Borkowski, Riff et al. 2003; Butz, Schneider et al. 2004; Ehnes, Hirota et al. 2004] allow a much larger system field of view and the ability to track objects moving in the environment. We use the generic term “Steerable Projectors” for these systems. 2.3.2 Mobile, Handheld Systems and Wearable Projector-Camera Raskar et al. initially developed the concept of handheld projector-camera systems in the “intelligent Locale-Aware Mobile Projector” (iLamps) project [Raskar, VanBaar et al. 2005]. As seen in Figure 2.14, the handheld projector-camera system included onboard computing, a tilt-sensor and network access. The projector-camera system was initially calibrated to determine the intrinsic (optical) parameters and extrinsic pose (relative locations and orientations) of projector and camera. Projection-based object adaptive displays were then demonstrated using circular fiducial markers to allow the system to calculate the camera pose (hence the projector pose). With projector pose known relative to an object (in this case the black rectangle with a fiducial marker in Figure 2.14), a projection can be registered with the object so it is overlaid on its surfaces. Due to the fixed field of view of a projector, mobile and handheld projectors can only be used to reveal information in the environment in a similar way to a flashlight being used to illuminate a surface. This “peephole” metaphor [Butz and Krüger 2003] allows the user to see the windows and interfaces by “painting them with light”, onto an area of the environment. Sony first presented a prototype handheld projector (shown in Figure 2.15) containing accelerometers to measure the hand movement and paint a small world stabilised image on surfaces [Rapp, Michelitsch et al. 2004]. This concept was extended by Raskar et al. to project onto photo-sensing objects in the environment with their Radio-Frequency Id and Geometry (RFIG) project [Raskar, Beardsley et al. 2006]. 17 2.3 PROJECTOR-BASED AUGMENTED REALITY Figure 2.14 (left and centre) The iLamps Project developed a handheld projector-camera system, (right) Detecting circular fiducial markers to augment a wall scene with projection Handheld projectors are now widely researched due to recent technology developments in micro displays and cheap, long-life LED and LASER light sources. These developments have caused a drastic reduction in the size of projectors, which will soon enable projectors to be carried in a pocket or embedded in mobile devices. Some recent research concentrates on calibration of handheld projectors [Dao, Hosoi et al. 2007], while other projects make use of the ability of projectors to create displays larger than the device itself, focussing on collaborative interaction techniques using multiple handheld projectors [Cao and Balakrishnan 2006; Cao, Forlines et al. 2007]. Figure 2.15 SONY’s Handheld spotlight projector [Rapp, Michelitsch et al. 2004], (centre left) Multi-user interaction using handheld projectors [Cao, Forlines et al. 2007], (right) Wearable Projector-camera system [Karitsuka and Sato 2003] 2.3.3 Multi-Projector Display Systems Traditionally, multi-projector displays were created by time-consuming mechanical alignment of the individual projectors to abut the images. Periodic re-calibration was often necessary, due to factors such as vibration which caused increasing alignment inaccuracies over time. As part of the iLamps project [Raskar, VanBaar et al. 2005], Raskar et al. demonstrate the first ad-hoc clustering of overlapping projectors, using camera-based registration and image blending to create a single geometrically seamless display (see Figure 2.16, right). The project also developed shape adaptive displays, where the display was geometrically corrected to appear undistorted on multi-planar and curved quadric surfaces using the least-squares conformal mapping approach proposed by Levy et al. [Levy, Petitjean et al. 2002]. For perceptually seamless displays, multi-projector displays require photometric and colourmetric calibration in addition to geometric calibration to ensure uniformity of brightness and colour across the display. This is discussed further in section 2.3.7. 18 2.3 PROJECTOR-BASED AUGMENTED REALITY Figure 2.16 (left) The PixelFlex reconfigurable multi-projector display [Yang, Gotz et al. 2001], (right) Ad-hoc projector clustering and geometric correction for seamless display with iLamps [Raskar, VanBaar et al. 2005] 2.3.4 Steerable Projector-Camera Systems Pinhanez at IBM proposed creating “interactive displays everywhere in an environment by transforming any surface into a projected touch screen” [Pinhanez 2001]. The system, which Pinhanez named the Everywhere Display (ED), uses a video projector and steering mechanism together with a camera to enable vision-based interaction with the projected imagery. The steering mechanism increases the area in which the display can be used, while the combination of projector and camera forms a powerful closed-loop visual control system, allowing display adjustment and unencumbered user interaction directly with the display. Figure 2.17 IBM’s Everywhere Display demonstration at SIGGRAPH [Pinhanez 2001] Pinhanez proposed steerable projector-camera systems for two classes of applications: interactive computer displays and spatial augmented reality. The steering mechanism enables ubiquitous interactive displays, as they can be situated on any surface or object within the field of view of the projector [Pingali, Pinhanez et al. 2003]. IBM first demonstrated simple augmented objects in their Everywhere Display demonstration at SIGGRAPH, seen in Figure 2.17. Here a table was augmented with a simple image display onto which the user could place individual coloured M&M candy “pixels” to build up a picture. To place the correct colour candy the steerable projector directed the user to a series of paint tins containing M&Ms, onto which were projected virtual buttons. The camera system detected user touch of this virtual button, transforming the paint cans into interactive interfaces. With steerable projection, the information required in any situation can be brought to the location it is required. Pinhanez demonstrates the concept of projected information that is “attached” to a spatial location [Pinhanez 2001], allowing phone books to appear at the location of the phone when the user picked up the handset, however, this location 19 2.3 PROJECTOR-BASED AUGMENTED REALITY was static and had to be programmed into the system. Pingali et al. approach the idea of location based interaction from a different direction, by creating a display that followed the user, allowing important information to be constantly visible [Pingali, Pinhanez et al. 2002]. In this project a series of static display areas were pre-calibrated around the room, then a camera used to track the user. The steerable projector calculated the nearest visible display area to the detected user, and used this as the display. As shown in Figure 2.18, Pinhanez et al. envision other uses, such as ambient displays in a user’s periphery for messaging notification, dynamic navigation signage and augmenting objects for ubiquitous computer access. Figure 2.18 (left) Messaging notification, (centre) dynamic navigation signage, (right) augmented filing cabinet object [Pinhanez 2001] The steerable projector concept has been developed further, with the FLUIDUM intelligent environment project using fiducial markers (see section 2.5) tagged onto objects such as books to locate them [Butz, Schneider et al. 2004]. Following a prescan of the environment, the system could successfully find and highlight objects that had not moved. Dynamic displays can also be created on mobile objects, such as the Portable Display Screen demonstrated by Borkowski et al. [Borkowski, Riff et al. 2003]. The display screen was a sheet of white card, modified by adding a black border to allow tracking with computer vision. The system allowed the user to carry a fully interactive display without any power or infrastructure requirements, apart from the projector-camera system. Ehnes et al. demonstrate a similar system [Ehnes, Hirota et al. 2004], where the white Personal Interaction Panel (PIP) was tracked using a fiducial marker. Figure 2.19 (left and centre) Portable Projected Display Screen [Borkowski 2006], (right) Personal Interaction Panel [Ehnes, Hirota et al. 2004] 2.3.4.1 Existing Steerable Projector Frameworks Levas et al. presented a framework for steerable projector-camera systems to project onto objects and surfaces in their Everywhere Display EDML framework [Levas, Pinhanez et al. 2003]. As shown in Figure 2.20, the EDML architecture comprised three 20 2.3 PROJECTOR-BASED AUGMENTED REALITY layers – the lower services level containing the actual hardware dependant implementation, the middle integration layer which abstracts and synchronises the hardware, and the high level application layer. The lower levels provided applications for explicit user modelling of displays using a 3D world modelling tool and provided user localisation and geometric reasoning for use in applications such as the user following display [Pingali, Pinhanez et al. 2002]. The integration layer provided the main classes of the API to build applications with the framework, event management, geometric distortion correction and handling of interactive content to be rendered on the virtual displays. While supporting a distributed architecture, the framework has a number of limitations. It is reliant on the user to explicitly model their world and pre-calibrate displays, hence is limited to known, static environments. Although dynamic occlusion detection of the pre-calibrated displays is possible, this is implemented by vision-based head tracking. Hence, the system cannot detect any non-human occlusion of precalibrated display locations (for example, by furniture). The framework also does not address any methods for multiple projectors to work cooperatively, assuming only a single projector in any environment. Figure 2.20 IBM’s Steerable Interface EDML Framework Pingali et al. [Pingali, Pinhanez et al. 2003] build on the EDML framework design to characterise steerable interfaces as exhibiting the six following qualities: 1. Moveable output interface - the ability to move video and audio around spatially. 2. Moveable input interface – such as steerable cameras and directional microphones. 3. Adaptation to user context - for example, the user’s location and orientation. 4. Adaptation to environment - reasoning about the geometry and properties of surfaces, adapting to dynamic conditions such as occlusions and ambient noise. 5. Device-free interaction - using multi-modal input techniques for interaction. 21 2.3 PROJECTOR-BASED AUGMENTED REALITY 6. Natural interaction - Intuitive and useable interaction, sensitive to user context. Figure 2.21 The IBM Steerable Interfaces Characterisation [Pingali, Pinhanez et al. 2003] Although the characteristics Pingali et al. propose are a good way to describe steerable interfaces, they do not directly address the Ubiquitous Computing vision of the future, which proposes computing embedded into everyday objects. It is conceded that “special purpose” devices for interaction could be accommodated; however, there is no support for projection on mobile smart objects in their interaction paradigm. We construct a steerable projector-camera system to enable vision-based detection, tracking and projection onto smart objects in our work. This steerable projector-camera system is discussed further in Appendix A. 2.3.5 Interaction with Projector-Camera System Displays Without input techniques, the ability to create displays on objects is analogous to having a set of portable televisions which can be moved around at will, but only display one channel. To enable the most functionality for users requires a balanced interface, hence in this section we look at how we can interact with projected displays. Figure 2.22 Interaction Techniques for Projected Displays 22 2.3 PROJECTOR-BASED AUGMENTED REALITY Many interaction techniques can be used with projected displays. Some of the main techniques in use are shown in Figure 2.22. Interaction can be primarily decomposed into on-surface techniques (where the user has physical contact with the projection surface) and off-surface (the user is remote). We also decompose the interface technology itself into vision-based and non-vision-based sensing. The challenge with all interaction techniques is to distinguish the difference between pointing or hovering, versus activating [Buxton 1990]. For control activation one issue is that the interaction techniques discussed here all have little tactile feedback. Users cannot feel the edges of keys and the force when pressed, so performance will likely be much lower than normal keyboards or mice. Consequently, users will not be able to achieve eyes-free interaction [Lewis and Potosnack 1997]. However, common solutions to increase performance can be used, such as generating an audio or visual feedback for the user whenever a control is activated (e.g. simulating a key press click). 2.3.5.1 Off-Surface Interaction Off-surface interaction techniques are primarily using pointing interfaces (LASER pointing [Kirstein and Müller 1998] or finger pointing) with vision detection. Pointer interaction techniques are similar to mouse interaction in that they allow target selection and gestures. However, Oh and Stuerzlinger’s experiments reported a laser pointer is only approximately 75% as fast as a mouse at selection [Oh and Stuerzlinger 2002]. Similarly, Laberge and Lapointe report that by using a homography based camera to screen calibration (see section 2.3.6), an typical accuracy of only ± 2 pixels could be achieved [Laberge and Lapointe 2003]. Cheng and Pulo [Cheng and Pulo 2003] also state that laser pointer interaction suffers from three additional problems: 1. Pointer instability due to jitter in hand movement. 2. On-screen pointer latency relative to LASER due to low camera frame rates 3. Selection of objects is restricted to gestures, pointer on/off events or dwell times They propose hiding the hand jitter and latency issues by either not showing an onscreen cursor (relying on the laser pointer spot itself as visual position feedback), or by using an invisible Infra Red laser pointer and showing only an on-screen cursor. 2.3.5.2 On-surface Interaction On-surface interaction techniques based on non-vision techniques use a variety of mechanical sensing methods to detect the location of fingers in real-time. The main drawback with all these techniques is the requirement for sensing equipment to be embedded in the display surface and the need to then calibrate the display-to-surface relationship to achieve correct sensor-projector alignment. Capacitive and resistive touch sensors are widely used in products such as smart whiteboards, laptop trackpads and touch-screen overlays. However, this technology is typically restricted to a single user and a maximum of one or two points of contact. Systems such as the Mitsubishi DiamondTouch system [Dietz and Leigh 2001] and Rekimoto’s SmartSkin [Rekimoto 2002] are able to support multiple simultaneous users with multiple points of touch, however, only DiamondTouch can differentiate between users. This relies on capacitive coupling and requires each user to remain in contact with a signal receiver to be visible to the system, limiting the interaction serendipity. Weight surfaces triangulate the location of objects and fingers placed on their surfaces using weight sensors, allowing them to be used as generic pointing devices [Schmidt, Strohbach et al. 2002]. This approach has several benefits – it is robust, 23 2.3 PROJECTOR-BASED AUGMENTED REALITY reliable and can preserve the original functionality of the surface, allowing many surfaces to be augmented cheaply. The technique can also successfully determine active and passive loads, for example, allowing it to differentiate between a static mug on a table and a user’s finger. However, it cannot differentiate between multiple users. Virtual Keyboard devices are designed to be used with mobile phones and PDAs to allow fast, convenient text input without the bulk and portability issues of a physical keyboard. The devices project an image of a keyboard on a planar surface and use optical sensing (typically LASER or infra red break-beams) to detect when a user’s finger enters the area of a key, generating a key press event. The reader is referred to [Kölsch and Turk 2002] for a full survey of virtual keyboards. Examples of tangible interfaces are discussed separately in section 2.2.3. Vision-based detection is attractive as it is non-invasive, requiring nothing more than a remote camera to detect interaction. Hence, hand and finger tracking is a large area of its own in Computer Vision research. Researchers typically employ a number of approaches, such as correlation tracking, model-based contour tracking, foreground segmentation and shape filtering [Borkowski, Letessier et al. 2004]. However, visionbased interfaces must contend with many challenges, such as variable and inconsistent ambient lighting, reflections, strong shadows and camera occlusion. Pinhanez et al. [Pinhanez, Kjeldsen et al. 2002] proposed that the two goals for a vision-based finger detection system are to detect when a user touches a projected “button” and to track where a user is pointing on a projected screen. For some rearprojection screens, vision can be used directly through the surface of the screen itself, such as with Frustrated Total Internal Reflection (FTIR) sensing [Han 2005] or stereovision for detecting finger location and distance from the surface [Wilson 2004]. However, for interaction with front-projected displays typical vision approaches perform poorly due to the potential for projected light to radically change the appearance and colour of a hand, as was shown in Figure 2.17 (right). This can render traditional techniques that track the hand’s shape or skin colour unreliable. Background subtraction also will not work if dynamic imagery is being projected significantly brighter than the background. As a result, some researchers have used Infra Red (IR) for detection. However this approach typically requires removing IR-blocking filters from consumer cameras, then physically fitting an IR-pass filter (preventing detection of visible light, and hence, colour) and separate IR illumination sources. Some projects make use of shadows from separate IR illuminators or the projection light itself, using off-axis cameras to detect a finger’s shadow on the projection surface. The location of the shadow relative to the detected finger allows calculation of the distance to the surface and the location when on the display [Kale, Kwan et al. 2004; Wilson 2005]. However, for on-axis cameras (such as a camera attached to a projector) Pinhanez et al. propose using a motion based approach to overcome the majority of the problems associated with front projection displays [Pinhanez, Kjeldsen et al. 2001]. The motion-based approach involves subtracting each frame from the frame before to create a motion mask. This approach can create a large amount of noise on the image, due to changes in the projected image or movement of objects and surfaces. Hence, morphological erode and dilate operations are performed to reduce noise and analyse neighbourhoods rather than individual pixels. Pinhanez et al. first tried a fingertip template matching approach in their Everywhere Display interfaces, however, this approach was easily fooled as any occlusion of the 24 2.3 PROJECTOR-BASED AUGMENTED REALITY projected interactive area generated false-positives [Kjeldsen 2005]. The small fingertip template used could also easily match at multiple locations on an image, so a concept of user location had to be introduced and the match furthest from a user was assumed to be the correct fingertip (requiring a tracked user). Figure 2.23 Fingertip matching in (a) Camera image, (b) Motion mask, (c) Finger tip template match [Pinhanez, Kjeldsen et al. 2001] Kjeldsen proposes using a new technique called “Polar Motion Maps” [Kjeldsen 2005], which splits the circular area surrounding an interactive button (Rw on Figure 2.24) into segments like a cartwheel and analyses each segment for the direction of movement of the energy in the motion mask. Figure 2.24 (far left) Polar Motion Map, (left) the corresponding segments unrolled to a rectangle, (right) PMM in use [Kjeldsen 2005] The PMM segments can be converted from polar to Cartesian coordinates, and the graph analysed for long narrow objects that approach from a consistent direction, do not pass through the target (Rw) then retract from the direction in which they approached. This “lightening strike” movement was found by Kjeldsen to be indicative of touching movements, allowing events such as movement of a hand through the target or occlusion (which would not exhibit narrow profiles that a retracted in the same direction of arrival) to be disregarded. Kjeldsen also proposes allowing users to define an interface arrangement themselves using tangible paper based representations of user interface widgets [Kjeldsen 2005], to. Here, the location of the paper objects defines the target for interaction and the identity of the object determines what interaction is possible. The Everywhere Display [Pinhanez 2001] vision system was then used to automatically generate an interactive interface, based on the spatial arrangement and identity of widgets it detects, as shown in Figure 2.25 (left). Letessier and Bérard proposed another, simpler solution [Letessier and Bérard 2004]. By manually increasing the exposure of the camera until the projected light appears overexposed, the hand gains the correct exposure, as shown in Figure 2.25 (right). Using this method traditional finger tracking techniques can be used, but require the camera exposure to be set manually to optimise the finger visibility and assumes that the light levels in the whole environment remain constant. 25 2.3 PROJECTOR-BASED AUGMENTED REALITY Figure 2.25 (left) Using a red paper slider widget [Kjeldsen 2005], (right) Increasing camera exposure to compensate for projector light [Letessier and Bérard 2004] Letessier and Bérard’s method is not useful in our work; when we increase the exposure of the camera to remove the projection, this will over-expose the view of the objects, preventing us from tracking them with the camera. Consequently, in our work we use the method proposed by Pinhanez et al. [Pinhanez, Kjeldsen et al. 2002] to implement a non-invasive vision-based finger interaction system. 2.3.6 Projected Display Geometrical Calibration Video projectors are designed to project light in a direction orthogonal to a planar projection surface [Pinhanez, Kjeldsen et al. 2002]. This location produces the best image on the screen, with no geometrical distortions. Projector or surface rotation away from this ideal orientation causes geometrical distortion of the projected image, often called “keystoning”. Similarly, non-planar display surfaces also distort the projection. Many projectors allow compensation for small amounts of geometric distortion by digitally warping the image before projection so that it appears rectangular when displayed on the projection surface. A few projectors include an inclinometer sensor to perform automatic vertical distortion correction when a projector is rotated vertically [Raskar, VanBaar et al. 2005], however, an inclinometer on its own cannot correct for distortion caused by rotation in the horizontal plane or distortion due to rotation of the projection surface. Consequently, the correction must typically be specified manually by the user, using the remote control. Figure 2.26 (left) On-axis projection gives a rectangular image, (right) Geometric distortion due to horizontal projector rotation when projecting onto a planar screen It is possible to implement automatic geometric distortion correction in software by dynamically pre-warping the image to project. It is always possible to correct for oblique projection on simple geometries, as long as the projection surface does not have a shape that occludes the projected light [Pinhanez 2001]. This enables projectors to be mounted anywhere in the environment with respect to the projection surface. For example, in a meeting scenario the projector could be located at the side of the room, where a presenter is less likely to cast shadows on the screen and where the projector does not occlude the audience’s view [Sukthankar, Stockton et al. 2001]. 26 2.3 PROJECTOR-BASED AUGMENTED REALITY Many different calibration strategies have been proposed to calculate the geometric correction required to allow an image to look un-distorted to an observer. Yang and Welch proposed an initial classification of calibration methods [Yang and Welch 2001], which was extended by Borkowski et al. [Borkowski, Riff et al. 2003]. The extended classification separates the methods based on whether they are on-line (real-time) calibration techniques performed while a system is operational, whether the system actively emits light or makes other active emissions to detect the surface, and whether the calibration is only valid for planar 2D surfaces or can be used for three dimensional surfaces. However, Borkowski et al. incorrectly classify the Laser Scanner, Structured Light and Stereo Vision calibrations and omit distortion correction based on projective texture mapping techniques. Our corrected classification is shown in Figure 2.27. Figure 2.27 Classification of projector-camera system geometrical calibration methods, based on [Borkowski, Riff et al. 2003] Of the geometrical calibration methods, the homography based and structured light techniques only require a projector and camera to be performed, however, the homography method is limited to planar surfaces. The projective texture mapping requires a-priori knowledge of the projection surface geometry and projector-camera pose (but can be used on-line when this is known), while the remaining techniques require additional hardware to enable calibration. The most common geometrical correction methods are discussed in more detail below: 2.3.6.1 Projective Texture Mapping Geometrically, projection is the inverse of camera viewing [Pinhanez 2001]. Hence, it is possible to model the projection by creating a three-dimensional virtual representation of the projection surface and locating a virtual camera at the same position and orientation as the projector in the real world. By creating the virtual camera with identical optical parameters to the real projector, the image seen by the camera will exactly replicate what the projector sees in the real world. Distortion correction is then performed automatically by projective texture-mapping of an image onto a virtual surface with the orientation and size that it would appear in the real world. The virtual camera now views an image that will appear geometrically correct when projected. 27 2.3 PROJECTOR-BASED AUGMENTED REALITY Figure 2.28 Model-based distortion correction based on known surface geometry and pose relative to the projector [Pinhanez 2001] This approach is conceptually simple, and can be easily implemented using 3D computer graphics hardware for real time performance when dynamically changing projector and display surface pose. However, the main weaknesses with this approach are that it requires an exact model of the environment to be created, and the exact pose of the projector relative to the display surface to be detected. This restricts its use to environments where the display geometry is known and tracked. 2.3.6.2 Planar Homography Calibration Pinhanez et al. first developed a mathematical method for geometric calibration of projector-camera displays on planar screens using homographies [Pinhanez, Nielsen et al. 1999]. A homography is projective transform that links two planar surfaces in threedimensional space. It is an exact mathematical description of the rotation, translation and scaling that maps each point between the two planes. To perform the calibration Sukthankar et al. first project a series of dots onto the screen and detect the location of the dots in the camera image, calculating the projector-to-camera transformation homography (T) [Sukthankar, Stockton et al. 2001]. By then detecting the location of the screen in the camera image (either using fiducial markers at the corners or the screen border edges) Sukthankar et al. calculate the camera-to-screen transformation (C). This allows the full projector image to screen transformation (P) to be recovered (P=C-1T). These transformations were used to pre-warp an image before projection to align correctly with the screen, as shown in Figure 2.30. Figure 2.29 Projecting dots onto a planar screen to recover the projector-camera homography For example, to calculate the projector-camera homography we would first establish correspondences between the projector and camera (such as projecting and detecting the centre of the dots in Figure 2.29), then use the following formulas to calculate the conversion between the coordinates in the projector image (u, v) and camera image coordinates (u´, v´): 28 2.3 u= au '+bv'+c gu '+ hv'+1 v= PROJECTOR-BASED AUGMENTED REALITY du '+ev'+ f gu '+ hv'+1 (2.1) These equations can be reformatted for homogeneous coordinates as follows: ⎛ u ⎞ ⎛ h11 ⎜ ⎟ ⎜ ⎜ v ⎟ = ⎜ h21 ⎜1 ⎟ ⎜ h ⎝ ⎠ ⎝ 31 h12 h22 h32 h13 ⎞⎛ u ' ⎞ ⎟⎜ ⎟ h23 ⎟⎜ v' ⎟ 1 ⎟⎠⎜⎝1 ⎟⎠ (2.2) Given n non-collinear corresponding points on both the image and display surface (where n≥4), the correspondences can be reformatted as simultaneous equations in equation 3.3. By assuming the constraint h = 1 , the coefficients (h11 to h32) can be determined using Gaussian elimination if n=4, or linear least-squares with Singular Value Decomposition (SVD) when n>4. ⎛ u1′ ⎜ ⎜0 ⎜ M ⎜ ⎜ u n′ ⎜ ⎝0 v1′ 0 1 0 0 u1′ 0 v1′ 0 1 M v n′ M M M M 1 0 0 0 0 − u n′ − v n′ − u1′ u1 − u1′ v1 − v1′ u1 − v1′ v1 M 0 − u n′ u n 1 − u n′ v n M − v n′ u n − v n′ v n − u1 ⎞⎛ h11 ⎞ ⎟⎜ ⎟ − v1 ⎟⎜ h12 ⎟ M ⎟⎜ M ⎟ = 0 ⎟⎜ ⎟ − u n ⎟⎜ h31 ⎟ ⎟⎜ ⎟ − v n ⎠⎝ h32 ⎠ (2.3) Figure 2.30 (left) Creating the projector-camera homography, (right) Using the combined projector-screen transformation to project an undistorted image [Sukthankar, Stockton et al. 2001] The calibration performed by Sukthankar et al. was for a static system. However, assuming the projector-camera relationship is fixed (e.g. the camera is attached to the projector) and the camera can detect and track a minimum of 4 non-collinear points on the target planar surface, this method works in real time. Borkowski et al. use this method to demonstrate real-time projection geometrically aligned with their portable display screen, by determining the 4 corners of the black borders [Borkowski, Letessier et al. 2004]. In static environments where projection is unconstrained and no visually defined screen area exists, Borkowski et al. also proposed an off-line pre-calibration of environment [Borkowski, Letessier et al. 2004]. This calibration enables a projectorcamera system and off-axis camera to detect planar surfaces and store them for future use. A world model is built by scanning the room while dynamically projecting a series 29 2.3 PROJECTOR-BASED AUGMENTED REALITY of straight lines. Lines which have a curved appearance in the camera image indicate a non-planar surface, while a step-like discontinuity is assumed to indicate the edge of a surface. Once all possible planar areas are detected, the projector can establish planar homographies and calculate the distance to the display surface. The homography approach has been extended to support automatic geometric calibration of multi-projector display walls [Chen, Sukthankar et al. 2002; Brown, Majumder et al. 2005; Bhasker, Sinha et al. 2006] and multi-planar surfaces [Ashdown, Flagg et al. 2004]. 2.3.6.3 Curved and Unknown Surface Calibration with Structured Light Curved or spherical projection surfaces can be corrected using the off-line triangulation technique and quadric calibration method proposed by Raskar et al. [Raskar, VanBaar et al. 2005]. This approach projects a sequence of structured light patterns (see Figure 2.31, left) and detects left-right correspondences using a calibrated stereo camera system to triangulate the display surface geometry. A Quadric curve fitting approach is used with a minimum of 9 correspondences (compared to 4 for a homography) to enable geometric correction, as shown in Figure 2.31 (centre). Figure 2.31 (left) Gray-code structured light patterns (centre) Geometrically corrected multi-projector display on curved display [Raskar, VanBaar et al. 2005], (right) Automatic Projector Display Surface Estimation Using Every-Day Imagery [Yang and Welch 2001] As discussed by Raskar et al., the geometric correction of non-planar geometry requires a tracked observer to provide the correct undistorted view. It is always possible to generate a geometric distortion correction for an arbitrary viewpoint given known surface geometry and calibrated cameras and projectors [Park, Lee et al. 2006]. Hence, if the observer’s viewpoint is not tracked it is usually assumed their viewpoint is perpendicular to the projection surface. To calibrate completely unknown geometry Park et al. [Park, Lee et al. 2006] propose an off-line calibration technique which projects a series of structured light images while using triangulation between the projector and a single camera (by assuming the projector is equivalent to another camera). This approach relies on both a calibrated projector and camera, but can recover a grid of points on an arbitrary surface. The surface is modelled from the recovered 3D points by assuming it is piece-wise planar in the small (i.e. between neighbouring points). The recovered geometry mesh is then used to warp the projected image using homographies to correct for geometric distortion. Yang et al. use an iterative on-line approach to recovering display surface geometry by making use of the closed-loop of a projector-camera system [Yang and Welch 2001]. 30 2.3 PROJECTOR-BASED AUGMENTED REALITY The system initially assumes an orthogonal planar screen, then detects features in each image before projection, observes these features when projected and continually refines an estimate of the display surface geometry based on the difference in location using a Kalman filter. Over a number of frames this will converge to approximate the real surface, as shown in Figure 2.31 (right). This approach is in effect a structured light approach; however, the only structure comes from the projected content. Johnson and Fuchs present a similar on-line approach for planar surfaces with realtime feature detection and tracking [Johnson and Fuchs 2007]. Their approach additionally predicts the pose of the projector at each frame, allowing use of a mobile projector with a static off-axis camera. Another on-line structured light approach using explicit patterns was proposed by Park et al. [Park, Lee et al. 2007]. Here the patterns were embedding into the display imagery in a way that is imperceptible to observers, by projecting a pattern in one frame, then its inverse in the next frame. The pattern is adapted to the colour distribution and spatial brightness variation of the original projected content to make it further imperceptible. This approach requires a camera synchronised to the projector for the pattern to be visible to the detection system and creates some visible flicker for humans with a typical 60Hz projector refresh rate. Grundhöfer et al. report the patterns are still visible with fast eye movements above 75Hz and propose a new approach which creates patterns which are below the human perceivable threshold, but still detectable by camera. This method estimates the threshold based on a human contrast sensitivity function and modifies the pattern contrast for each frame, based on the brightness and spatial frequencies of the projected image and original pattern [Grundhöfer, Seeger et al. 2007]. The temporal-coded structured light approaches are not useful in dynamic environments as they require several projected frames to recover geometry of the target display surface, hence cannot easily be used with mobile display surfaces. With spatially-coded approaches it is possible to recover geometry with a single projected pattern, however, these approach create lower resolution geometry models and tend to be much less robust [Salvi, Pages et al. 2004]. Additionally, while the structured light approaches can detect and model all geometry inside the projector’s field of view, they cannot robustly identify and segment an arbitrary 3D object from the background on their own. Hence, they would have to be used with other computer vision methods when we wanted to detect a particular object or project at a specific location on an object. Instead, in our work we assume each object carries knowledge of its 3D model, which removes the need for on-line geometry recovery and allows projection at specific locations on the object. Once objects are detected and their pose calculated, the known surface geometry is used by the projector system to dynamically configure which calibration method is used. For example, with planar surfaces we could use the homography or projective texture mapping methods, and for curved surfaces the quadric calibration proposed by Raskar et al. [Raskar, VanBaar et al. 2005]. 2.3.7 Projection Photometric and Colorimetric Calibration We do not limit the type, form or appearance of objects that can be used with our Cooperative Augmentation approach. Hence, it is likely many surfaces of everyday objects we would like to augment with displays are coloured, textured or non-planar, which is not ideal for projection. Similarly, we may want to make use of multiple 31 2.3 PROJECTOR-BASED AUGMENTED REALITY projectors to create seamless displays on objects. In this section we look at different ways to address these cases by calibrating the projection to make the image more visible. Photometric and colorimetric calibration is used in two scenarios: the first is when creating a perceptually seamless display using multiple projectors, generally with a normal white projection screen. With ad-hoc multi-projector displays main problem encountered is that the characteristics of the projectors vary. For example, projectors from different manufacturers differ in terms of their photometric response, colorimetric response and uniformity. Even among projectors of the same make and manufacturer the projector lamps vary in both brightness and colour as they age, as can be seen in the variable brightness and hue of the blue backgrounds in Figure 2.16 (left). To appear as a seamless, uniform display this necessitates calibration, however, the complexity of geometric alignment and colour calibration increases with the number of projectors. Several approaches have been presented for automatic calibration of multi-projector systems using a single camera [Raij and Pollefeys 2004; Brown, Majumder et al. 2005; Bhasker, Sinha et al. 2006]. Typically these involve measuring the luminance and chrominance properties within each projector’s image and the differences between the projectors. These intra-projector and inter-projector variations are used to calculate individual mappings to a shared brightness and colour range which can be used by all projectors to achieve a seamless display. Brown et al. found this approach provides a poor image, as the common range for all projectors can be very narrow, limiting contrast and colour gamut. Humans can detect luminance variations more easily than chrominance variations, hence Brown et al. propose trading-off chrominance uniformity for increased display quality by approximating perceptual uniformity [Brown, Majumder et al. 2005]. Figure 2.32 (left) Colour correction to project on any surface [Bimber, Emmerling et al. 2005], (right) Real-time adaptation for mobile objects [Fujii, Grossberg et al. 2005] The second scenario where photometric and colorimetric calibration has been demonstrated is to compensate for the display surface itself in the case where the surface is non-white, non-uniform or has non-lambertian reflectance. Again, a range of techniques has been proposed [Nayar, Peri et al. 2003; Grossberg, Peri et al. 2004; Bimber, Emmerling et al. 2005]. The calibration process typically includes capturing the natural surface appearance and it’s response to projected colour calibration images with a camera. The calibration techniques then modify the projected image by attempting to invert the natural colour of the object surface based on the recovered reflectance responses, making it more visible to an observer (as shown in Figure 2.32). 32 2.3 PROJECTOR-BASED AUGMENTED REALITY The methods cannot completely correct very dark or saturated surfaces, as the dynamic range of typical projectors is not sufficient to invert the natural surface appearance. While these calibration techniques assume a static environment, Fujii et al. extend [Nayar, Peri et al. 2003] by proposing a real-time adaptation approach [Fujii, Grossberg et al. 2005] by making use of the projector-camera system closed loop feedback. They use a 3x3 matrix for each pixel to encode both the colour mixing between the channels of projector and camera and the surface reflectance. The values of the matrices are determined by projecting a series of uniform colour images and capturing the images of the illuminated surfaces with the camera. The compensation image is computed by multiplying the inverse of the colour mixing matrices with the RGB colour of the corresponding pixels in the input image. Following initial calibration, the method corrects the projection image based just on the image captured by the camera, optimising the projected image to look as much like the input image as possible, when seen by the camera. Although requiring a camera and projector with calibrated colour response, this approach allows dynamic changes in illumination or colour of the projection surface and a mobile object or projector-camera system, as shown in Figure 2.32 (right). Similarly, Grundhöfer and Bimber extend their work to achieve real-time correction by using the GPU for calculation, while simultaneously adjusting the content of the input images before correction to reduce the perceived visual artefacts from limited projector dynamic range [Grundhöfer and Bimber 2008]. The two scenarios presented using the calibration methods generally assume the display surfaces are planar and orthogonal to the projector. In our work this is often not the case, as objects are mobile and need not be planar. The amount of light that arrives at the projection surface depends both on the distance and angle of the surface with respect to the projector, with oblique surfaces receiving less illumination and appearing darker. Hence we also need to incorporate intensity compensation based on the orientation of the object for non-planar objects, before the radiometric and colorimetric calibration is applied. Figure 2.33 (left and centre) Uneven illumination on a non-planar surface, (right) uncompensated and intensity compensated image [Park, Lee et al. 2006] Bimber et al. propose a formalisation of this compensation based on square distance to the surface and the angle of the projected light [Bimber, Coriand et al. 2005], however, generally we do not care about the absolute brightness as long as the projection is uniform. Hence, in our work, a simplified method proposed by Park et al. [Park, Lee et al. 2006] can be used as we know the surface geometry and pose of an object relative to the projector. 2.3.8 Issues with Projector-based Augmented Reality Projector-based AR faces issues which do not occur with other display technologies, for example, the display brightness must overcome the ambient illumination, and that 33 2.3 PROJECTOR-BASED AUGMENTED REALITY image warping for geometric correction reduces the final achieved display resolution. However, there are three main problems which occur when using projection: 2.3.8.1 Display Occlusion The most obvious problem with front-projection is occlusion of the projection, either by other objects in the environment, or by an interacting user. It was observed both by Pinhanez et al. and Summet et al. [Pinhanez, Kjeldsen et al. 2002; Summet, Flagg et al. 2007] that users had to develop coping strategies for occlusions with front projection, moving their bodies and limbs to minimise occlusion. Summet et al. propose a solution using multiple redundant projectors in the environment and an off-axis camera to detect occlusions of the display. The projectors are separated spatially but their frustums overlap so that the projected image from one projector is able to “fill-in” areas occluded in the second projector. In the active system [Summet, Flagg et al. 2007], the “Virtual Rear Projection” (VRP) method can also suppress the projected light on dynamically occluding users and objects, removing the blinding effect when users look towards the projector. While this approach works well when front projection is required, experiments reported that users still preferred rear projection systems. The solution also requires multiple projector systems, doubling the equipment costs for each display. Real-time correction was also demonstrated using the GPU for the calculations [Flagg, Summet et al. 2005]. Figure 2.34 VRP used to overcome occluding light shadows [Summet, Flagg et al. 2007] Both the VRP shadow removal techniques and multi-projector blending algorithms to blend of overlapped projections are only suited to static projectors and display surfaces. When projecting on mobile objects, any mis-registration between the projections (for example, due to error in object pose calculation or differing detection, rendering and projection lag times in multiple projector-camera systems) creates a blurred or unreadable projection. Ehnes et al. propose another approach when using multiple steerable projectors and mobile objects [Ehnes, Hirota et al. 2005]. Similar to the VRP approach, multiple projectors have objects within their field of view simultaneously. However, instead of overlapped projection Ehnes et al. propose using a central application server, which assigns display rights based on the first projector-camera system to detect the object. A camera-reported quality measure (such as the distance and angle of the object surface with respect to the camera) is then used to determine when to change the display to another system [Ehnes and Hirose 2006]. This approach does not model occlusion directly, however, as the steerable projectors use an on-axis camera, any occlusion of the projection would also occlude the object in the camera, causing a low quality measurement. The display would then automatically switch to another projector with the object in view. 34 2.3 PROJECTOR-BASED AUGMENTED REALITY 2.3.8.2 Projection Focus Distance Typical projectors can only focus on a single focal plane located at a constant distance from the projector. Hence, projecting images onto non-planar surfaces, highly oblique surfaces or multiple surfaces at different distances causes blurring. If projectors possess computer controlled powered focus it is possible to calibrate the focus-distance relationship (as discussed in Appendix A.11) and dynamically focus on objects when their distance to the projector is known. Recent work by Bimber et al. proposes using multiple projectors focused at different distances in static systems [Bimber and Emmerling 2006], however, future LASER projectors will not have this problem at all, appearing in focus at all distances. 2.3.8.3 Acceptability of Display Although it is possible to project interactive interfaces onto objects in an environment, if users do not realise the objects and displays are interactive, then there is no benefit beyond a traditional display. At a demonstration of the Everywhere Display steerable projector Kjeldsen et al. found that users failed to recognise a paint can had an interactive projected button, despite having just successfully touched two similar buttons on a wall and table [Kjeldsen, Pinhanez et al. 2002]. To investigate this further, Podlaseck et al. performed a series of user studies to investigate the acceptability of projected interfaces on everyday objects [Podlaseck, Pinhanez et al. 2003]. These studies propose the possibility that people have difficulty cognitively perceiving high functional-fixedness objects as interfaces. One illustrative example was a red virtual button projected onto the surface of a glass of milk, as seen in Figure 2.35 (left). At the end of the first experiment 10% of the subjects could not even remember seeing a glass of milk, which Podlaseck et al. believes was because they had mentally subtracted it from task at hand. Similar results were presented in a different context by Rensink as inattentional blindness (where observers closely watching a particular object fail to see other unexpected objects) [Rensink 2000]. They theorise that users may effectively be “blind” to interfaces on some objects. Consequently, some mechanism must be employed to increase their visibility and highlight an application’s connection to the object. This presents a challenge to the direct integration of information with objects in real-world, suggesting that the connection of an interface to an object may need to be made more salient and explicit. Figure 2.35 (left) Interactive buttons projected on a variety of everyday surfaces, (right) Dynamic Shader Lamps projecting on mobile tracked object [Bandyopadhyay, Raskar et al. 2001] Weiser and Brown write “an affordance is a relationship between an object in the world and the intentions, perceptions and capabilities of a person [Weiser and Brown 35 2.3 PROJECTOR-BASED AUGMENTED REALITY 1996]. The side of a door that only pushes out affords this action by offering a flat pushplate.” Similarly, hints to the user could be weaved into the display to suggest interaction modalities, such as a “touch me” label on a touch sensitive button. However, such explicit messages are at odds with the deign goal of many interfaces aiming to be more natural and intuitive. Instead, subtle cues such as animation or interactive elements positioned specifically on objects close to a user could be used. 2.3.9 Mobile Objects There has been much work on displaying projected content onto static areas of the environment [Pinhanez 2001], or static objects [Raskar, Welch et al. 1999; Butz, Schneider et al. 2004; Bimber, Coriand et al. 2005]. Similarly, there has been work on augmenting planar objects in very constrained scenarios such as “digital desk” scenarios [Wellner 1993; Robertson and Robinson 1999; Koike, Sato et al. 2001]. However, there has been relatively little work on augmenting mobile objects with projected displays in more unconstrained scenarios. Morishma et al. first augmented mobile objects with projected displays [Morishima, Yotsukura et al. 2000]. Here pre-recorded videos of faces speaking were projected onto a planar mask worn by a live actor. The mask was tracked by a camera and a homography calibration used to warp the video to correct for geometrical distortion. The projector was located in a shopping trolley pushed by the actor; hence the possible range of movement of the actor was limited. Increased working ranges have been demonstrated by using steerable projectorcamera systems to track mobile objects, such as the Personal Interaction Panel discussed in section 2.3.4. Similarly, Ehnes et al. use visual markers to track objects [Ehnes, Hirota et al. 2004]. The use of a steerable projector allowed tracking to occur anywhere within its field of view and information or graphics to be projected at locations relative to the marker, as shown in Figure 2.39 (right). Bandyopadhyay et al. investigated augmenting mobile objects with projected displays [Bandyopadhyay, Raskar et al. 2001]. Here 3D objects with planar surfaces were equipped with a magnetic and infra-red tracking system. Static projectors were used to augment the objects in real-time, as shown in Figure 2.35 (right). These projectors each required an initial calibration to match 3D points in the real world to 2D pixels in the projector image and hence, calculate the transformations between projector and tracking system’s coordinate system. A virtual 3D painting application was used to demonstrate the concepts of annotation and visualisation with projected displays. However, this work suffered from two key problems due to the tracking systems used. The first problem was that latency was around 110ms, which led to a 1cm distance lag even when the object was moved at very slow speeds. The Polhemus Fastrack magnetic tracker caused a second problem by limiting working range to 50-75cm. In our approach we use a projector-camera system with a vision-based object detection system, allowing augmentation of objects anywhere within the field of view of the system at camera frame-rates, without relying on separate tracking hardware. 36 2.4 DETECTION AND TRACKING OF OBJECTS 2.4 Detection and Tracking of Objects To augment objects with projection, we must first calculate their location and orientation (i.e. their pose) with respect to the projector-camera system. There are many approaches to detecting and tracking camera or object pose in AR literature, for example, mechanical, magnetic, ultrasound, inertial, vision-based solutions, and hybrid systems combining different approaches [Azuma 1997; Rolland, Baillot et al. 2001]. Each approach has both advantages and disadvantages. For example, mechanical trackers are accurate but require a direct attachment, hence limit working volume. Magnetic trackers again have limited working volume and are additionally affected by metal objects in the environment. Ultrasound is susceptible to noise (for example, from jangling keys) and drift from temperature variations. Similarly, inertial systems (gyros and accelerometers) are expensive and again subject to drift over time [Lepetit and Fua 2005]. In contrast, while vision-based detection and tracking has a higher processing cost, it generally provides more accurate AR, as the pose is calculated directly from features in the image to be augmented. Vision-based detection and tracking in AR is split into fiducial marker-based and marker-less approaches. The marker-based approaches present a robust, low-cost solution to 3D pose estimation in real-time. They require no initialisation by hand, are robust to marker occlusion and provide a pose accurate enough for AR in good illumination conditions. However, they require changing the appearance of objects or engineering the scene to achieve this detection. In contrast, while markerless approaches use the natural appearance of an object, they typically require an off-line training phase to learn the object appearance. The processing cost is also generally higher than marker-based approaches, as detection algorithms tend to be more complex due to the difficulty in achieving a robust detection. 2.5 Fiducial Marker Detection and Tracking Fiducial marker based visual tracking approaches are widely used in the AR community, especially in experimental prototypes due to the availability of open-source tracking toolkits (such as ARToolkit [Kato and Billinghurst 1999] and ARTag [Fiala 2005]) allowing easy development. Many toolkits exist, typically each with their own marker appearance (such as rectangles, circles, semi-circles, and chessboard-like patterns), as shown in Figure 2.36 (left). Readers are referred to [Zhang, Fronz et al. 2002] for an in-depth comparison. The fiducial markers themselves are printed planar patterns (similar to 2D barcodes) that a computer vision system can easily detect, track and decode, enabling information to be associated with the marker. Augmented Reality systems make use of this functionality to overlay computer generated information on top of detected markers using the detected marker pose for registration of the information with the real world. Traditionally a user wears a head mounted display (HMD), or carries a portable viewer to see the overlaid computer graphics, as shown in Figure 2.36 (right). 37 2.5 FIDUCIAL MARKER DETECTION AND TRACKING Figure 2.36 (left) Examples of different Fiducial Marker systems [Fiala 2005], (right) Using ARToolkit Markers with a HMD to view a virtual character [Kato and Billinghurst 1999] The augmentation process for ARToolkit is shown in Figure 2.37. Other marker systems employ a conceptually similar process, but may detect the marker outline using edge based techniques for robustness, and decode information encoded in the marker rather than using cross-correlation to identify the marker [Fiala 2005]. Figure 2.37 ARToolkit Recognition and Overlay method [Kato and Billinghurst 1999] Unfortunately, fiducial marker approaches exhibit six main limitations: 1. Minimum Marker Size in Camera Images. The distance from the camera at which a marker can be detected decreases with both decreasing marker size and decreasing camera angular resolution. Hence, small markers, low resolution cameras or wide Field-Of-View (FOV) cameras will typically have a small tracking range. Fiala stated that for ARToolkit markers, a 75% marker recognition accuracy required a minimum marker edge size of 13 pixels in the image with a greyscale VGA camera and 18 pixels with a colour camera [Fiala 2005]. 2. Poor Location Accuracy. Malbezin et al. characterised the accuracy of ARToolkit marker tracking with a calibrated camera in room sized environments, between 1m and 2.5m distance from the camera [Malbezin, Piekarski et al. 2002]. It was found that the reported position error increased with distance, from 6 to 12% in the marker X axis and 9 to 18% in the marker Y axis, with the accuracy being further affected by the angle of the marker to the camera. They conclude that once calibrated, any systemic inaccuracy could be corrected by applying a filter to the detection results; however, the filter would require explicit calibration for every camera and lens combination used. 3. Limited Number of Useful Marker. Some fiducial marker toolkits encode an ID to differentiate between different markers (e.g. ARTag toolkit [Fiala 2005]). The limit 38 2.5 FIDUCIAL MARKER DETECTION AND TRACKING on the number of markers that can be detected is the number of visually distinct markers that can exist with the encoding pattern used. For example, ARTag has a library of 2002 rectangular markers with an encoded 11 bit unique numerical identity (ID). To increase the number the number of unique objects that can exist in the world the number of bits allocated to the ID (and hence the physical marker size) must be increased. Similarly, circular marker systems such as TRIP [Ipiña, Mendonça et al. 2002] are generally limited only by their physical size – extra rings with more bits can always be added to the outside of the marker. ARToolkit takes a different approach, with no fixed encoding scheme. This allows users to generate unique and visually distinct patterns inside the rectangular black border used for initial detection. However, the user must then pre-calibrate the system and explicitly specify which patterns the system should load and search for at runtime. 4. Marker Occlusion. Different marker toolkits exhibit different robustness to partial occlusion of the markers [Fiala 2005]. For example, as shown in Figure 2.38 (left and centre), ARToolkit detection will fail if even a small portion of the external marker border is obscured. In contrast, ARTag detection remains robust as long as three sides remain visible. This issue with ARToolkit can be used as a feature in multi-marker tracking scenarios, allowing failure in detection for one marker to be inferred as deliberate user interaction, for example, finger activation of a virtual button. Figure 2.38 (left and centre) Occlusion of ARToolkit marker prevents tracking, (right) Occlusion of ARTag marker [Fiala 2005] 5. Marker Illumination. As shown in Figure 2.39 (left), some AR toolkits lack robustness under varying illumination conditions due to use of a global image threshold [Naimark and Foxlin 2002]. Solutions to this include local or adaptive thresholding, or the use of more robust marker systems, such as ARTag, which use edge detection methods [Fiala 2005]. This is a serious issue for our work, as any projector light overlapping a marker may prevent detection. 6. Visual Intrusiveness. As adding fiducial markers on objects is visually intrusive, Park and Park propose using Infra Red (IR) absorbing markers [Park and Park 2004]. These markers appear black to an IR camera in environments illuminated with IR light, but are invisible to the human eye. Similarly, Nakazato et al. and Santos et. al demonstrate IR retro-reflective (but visually translucent) markers [Nakazato, Kanabara et al. 2004; Santos, Stork et al. 2006]. While such techniques remove visual intrusion, they still share the limitations of visual markers and additionally rely on the presence of IR light sources for detection. 39 2.6 NATURAL APPEARANCE DETECTION AND TRACKING Figure 2.39 (left) AR Toolkit fails to detect markers under different illumination conditions in the same frame [Fiala 2005], (right) Ehnes et al. track ARToolkit markers with a steerable projector and project on modelled surfaces in the marker area [Ehnes, Hirota et al. 2004] With a view to augmentation in ubiquitous computing environments the marker based approaches are not ideal, as they require physical modification or engineering an objects external appearance to enable detection. To achieve robust detection at distance requires a large marker, which in turn reduces our available projection area, as illumination change on the marker due to projection may prevent detection. Additionally, use of markers is problematic for objects with non-planar surfaces, and 3D objects typically require multiple markers (e.g. a cube requires at least one marker per surface = 6) to achieve detection in all poses. Hence, our approach in this work uses the natural appearance of an object for markerless detection, allowing augmentation with displays without permanently changing an object’s appearance. 2.6 Natural Appearance Detection and Tracking The use of natural appearance for marker-less detection in AR is generally a widebaseline matching problem, where natural features in an unknown scene have to be matched to a model of the object pre-built from training images, despite changing viewpoint and illumination conditions. In this work we consider only monocular (i.e. single camera) model-based detection and tracking approaches. We also distinguish between detection and tracking. Detection is identifying a known object in an unknown scene, whereas tracking is typically limited to following a previously detected object in an unknown scene. Tracking algorithms either require an initial detection step, or require that the object to be tracked is manually manipulated into a pose close to the starting pose assumed by the tracker. In this thesis we concentrate on object detection, however to understand the difference more clearly we first look briefly at tracking approaches before investigating detection in more detail. The reader is directed to [Lepetit and Fua 2005] for a full survey of 3D model-based detection and tracking. 2.6.1 Tracking Recursive tracking algorithms only face a narrow-baseline matching problem, as they typically just consider the last few camera frames to predict the current pose. Over such small timescales any object or camera movement is generally small and object appearance is not likely to change. However, object or camera occlusion, fast motion, or appearance and illumination changes between consecutive frames can cause loss of tracking, requiring re-initialisation either by another detection step, or manual realignment. Recursive tracking in general is also susceptible to error accumulation 40 2.6 NATURAL APPEARANCE DETECTION AND TRACKING [Lepetit and Fua 2005]. The advantages of tracking approaches are that a predictorcorrector framework constrains the search for matching features between frames based on their recent location, reducing image processing requirements and hence allowing faster, more real-time performance. The two most common frameworks on which trackers are based are the Kalman filter [Kalman 1960] and Particle filter [Isard and Blake 1998]. Both these frameworks use a Bayesian formulation to estimate the probability density of successive poses in the space of all possible camera or object poses. Kalman filters only consider a Gaussian distribution, whereas Particle Filters use a more general representation with a set of weighted pose hypotheses. Each particle is an individual hypothesis, allowing multiple hypotheses to be supported simultaneously. However, to describe complex motion a large set of particles is required, with consequent high processing cost. Particle filters are also more prone to jitter in the predicted pose, which can be smoothed using a filter [Isard and Blake 1998]. Objects with strong edge contours also allow use of edge tracking methods such as the popular RAPiD algorithm [Harris 1993]. RAPiD based algorithms generally project lines from the object’s known 3D model into camera image coordinates, based on an estimate of the object’s pose, as shown in Figure 2.40 (a). At intervals along visible lines a number of control points are generated (b). A 1D search is performed at each control point for edges in the camera image, perpendicular to the projected lines (c). The 3D motion of the object between consecutive frames can be calculated from the 2D displacement of the control points and used to predict both the pose in the next frame and hence, which control points will be visible [Drummond and Cipolla 2002; Klein and Drummond 2003]. Figure 2.40 (left) Drummond and Cipolla’s RAPiD-like algorithm, (right) robustness to partial occlusion [Drummond and Cipolla 2002] The main problem with algorithms based on RAPiD is their lack of robustness; as incorrectly matched edges from occlusion, shadow, object texture or background clutter cause incorrect pose computation. Hence, a number of extensions have been proposed to make RAPiD more robust, such as grouping control points to form primitives (such as a line or quadrilateral), the use of robust estimators such as RANSAC to detect outliers or integration of RAPiD into a Kalman filter [Lepetit and Fua 2005]. Approaches based on extracting and matching line segments to a 3D model are also possible, trading-off generality for robustness [Deriche and Faugeras 1990; Lowe 1992; Koller, Danilidis et al. 1993]. However, the edge based methods are generally fast, simple, robust to changing illumination and object scale. Their main problem is failure due to mismatching when backgrounds become cluttered or when rapid change in pose occurs. 41 2.6 NATURAL APPEARANCE DETECTION AND TRACKING 2.6.2 Detection There are several challenges when detecting objects using their natural appearance. For example, objects themselves have widely differing appearances, and we need to detect them over a large scale range and with the object in any pose. Many approaches have been proposed using cues of an objects appearance, such as colour, texture, shape and features on the object. A selection of popular approaches to detecting these cues are described below: 2.6.2.1 Colour Colour is a powerful cue for humans, used in many ways everyday in the real-world; for example, in traffic lights, as identifying features in man-made products, in advertisements or warning signs. In computer vision, colour histograms, first proposed by Swain and Ballard [Swain and Ballard 1991], have been shown to be invariant to rotation and robust to appearance changes such as viewpoint changes, scale and partial occlusion and even shape. Hence, a 3D object can be represented using a small number of histograms, corresponding to a set of canonical views (Swain and Ballard recommend 6 views). However, in their original work Swain and Ballard reported the need for a sparse colour distribution in the histogram to distinguish different objects, which can be achieved using a high dimensional histogram. Similarly, colour histograms are illumination variant, so illumination intensity, temperature and colour will all affect the final histogram. In contrast, histogram matching techniques are generally robust, as the histogram representation uses the entire appearance of the object rather than just a small number of interest points. Histograms are created by dividing a colour-space into discrete units (bins) and filling those bins with each pixel of that colour from the source image. The result will be a set of bins which represent the approximate colour distribution in the image. Histograms can be multi-dimensional (e.g. 3D Red-Green-Blue histograms) and bins can be larger than a single value (e.g. for an 8-bit RGB colour image each bin could be 16 colour values wide, giving a 16x16x16 bin histogram). As most objects have surfaces composed of regions of similar colour, peaks will be formed in the histogram. Objects can be detected by matching a colour histogram from a camera image region to a histogram from a training sample of the object using either the histogram intersection measurement (for two histograms V and Q): ∩ (Q, V ) = ∑ min(q , v ) i i i (2.4) or statistical divergence measurements, such as chi-square (χ2): χ 2 (Q, V ) = ∑ i (q i − v i )2 qi + vi (2.5) With histograms a high match value will still be obtained, even if individual pixel colour matches are not exact, because the regions are well matched. Swain and Ballard also propose histogram back-projection as a way of locating an object in an image. Here the colour values in the source image are replaced by matching values from a ratio histogram calculated between the object model and image histogram. Swain and Ballard then propose convolving the result image with a mask of 42 2.6 NATURAL APPEARANCE DETECTION AND TRACKING the estimated size of the object and extracting maxima which correspond to an expected location. 2.6.2.2 Texture Many objects cannot be described by colour alone (for example, black objects), hence objects with visible texture on their surfaces allow a wider range of techniques to be employed. For example, template matching approaches are fast and can be easily used to track 3D objects with planar surfaces (assuming a calibrated camera), although this method is susceptible to failure with occlusion and must be explicitly trained with varying illumination to become robust to illumination change [Jurie and Dhome 2002]. The histogram approach of Swain and Ballard was generalised to multidimensional histograms of receptive fields by Schiele and Crowley to detect object texture [Schiele and Crowley 2000]. The histograms encode a statistical representation of the appearance of objects based on vectors of joint statistics of local neighbourhood operators such as image intensity Gaussian derivatives (Dx,Dy) or gradient magnitude and the local response of the Laplacian operator (Mag-Lap), as shown in Figure 2.41. Multidimensional histograms are used to provide a reliable estimate of the probability density function without being computational expensive. Experimental results show the histograms are robust to partial occlusion of the object and are able to recognise multiple objects in cluttered scenes in real-time using the probabilistic local-appearance hashing approach proposed by Schiele and Crowley. When matching, Schiele and Crowley found Chi-squared divergence function provides better detection results than intersection, with respect to appearance changes, additive Gaussian noise and blur [Schiele and Crowley 2000]. Another approach to presented by Schaffalitzky and Zisserman develops a region descriptor for texture detection, using a class of statistical descriptors which is invariant to affine viewpoint and photometric transformations [Schaffalitzky and Zisserman 2001]. Their method was demonstrated directly in wide-baseline matching to calculate the epipolar geometry between two views despite significant changes in viewpoint; however, their work makes the assumption that all regions are planar. Figure 2.41 (left) Two dissimilar objects and their Mag-Lap histograms corresponding to a particular viewpoint, image plane rotation and scale. [Schiele and Crowley 2000], (right) Shape Context 2D log-polar histogram based on relative point locations [Belongie, Malik et al. 2002] 2.6.2.3 Shape Object shape can be described using either a global or local shape description. For global shape typical approaches use PCA-based methods [Turk and Pentland 1991; 43 2.6 NATURAL APPEARANCE DETECTION AND TRACKING Murase and Nayar 1995]. Here a global eigenspace is built as training images are projected into it. PCA analysis reduces the dimensionality of the training image set, leaving only those features in the images that are critical for detection. To train the representation typically a set of images of each object in different poses with varied lighting (to make it illumination invariant) are used. This image set is compressed to a low-dimensional sub-space by computing eigenvectors and eigenvalues on the covariance matrix of the training images and the highest eigenvectors kept. These eigenvectors represent a manifold. When detecting an object in an unknown image, the image is projected into the eigenspace and recognised based on the manifold it lies on. The pose is determined by the location of the projection on the manifold [Murase and Nayar 1995]. PCA based methods require the image projected to the eigenspace to be the same size as the training images, hence a scale-space must be created for scaleinvariant detection [Lindeberg 1990]. Turk and Pentland also propose using a 2D gaussian centred on the training and test images to reduce image intensity at the borders, as cluttered backgrounds negatively affect the detection [Turk and Pentland 1991]. Local shape can be detected using many methods. Here we look at methods that describe the silhouette contours of an object, as these can be directly matched to a precomputed database of object appearances with the object in different poses. For objects with known exact 3D models this database of object appearances can be calculated directly by rendering the model in different poses and extracting the silhouette contour using an edge detection algorithm, such as the Canny algorithm [Canny 1986]. The Curvature Scale Space (CSS) algorithm is part of the MPEG7 standard, for use in 2D shape detection [F.Mokhtarian 1995]. CSS is a “multi-scale organization of the inflection points of a 2D contour in an image”, i.e. zero-crossing points on the contour are detected when the contour changes direction. The multi-scale segmentation renders the system robust to edge noise and local shape differences. This algorithm has been demonstrated in 3D object detection [F.Mokhtarian, Khalili et al. 2001] and because the silhouettes of objects will be very similar with only small changes in pose, Mokhtarian and Abbasi also proposed a way of determining how many unique training views are required in an appearance database to maximise detection rates [F.Mokhtarian and Abbasi 2005]. This allows objects with complex and non-rotationally symmetric geometries to estimate pose by direct matching with CSS to the appearance database. The Shape Context method described by Belongie et al. [Belongie, Malik et al. 2001; Belongie, Malik et al. 2002] is related to deformable templates [Aixut, Meneses et al. 2003; Felzenszwalb 2005] and has achieved success in charcter recognition. The contours of an object are first extracted using an edge detection algorithm and a set of points detected on the contours, either equally or randomly spaced. The shape context algorithm then captures the relative distribution of points in the contours relative to each point on the shape. As shown in Figure 2.41, a histogram of the number of contour points in each “bin” of the log-polar coordinates can be created to describe each point. Descriptors are similar for homologous (corresponding) points and dissimilar for nonhomologous points, hence objects can be detected by matching the log-polar histograms and generating point correspondences. As the number of points in training and test images is identical, the matching process is a linear assignment problem, with the goal of assigning the best match to each point. Here, the histograms are used to calculate and minimise the cost for each match. Belongie et al. then align training and test images using a thin-plate-spline method of point-set alignment to model the deformation required to align the two images. 44 2.6 NATURAL APPEARANCE DETECTION AND TRACKING 2.6.2.4 Features Another approach using the natural appearance of an object is local features. Local feature based detection algorithms aim to uniquely describe (and hence detect) an object using just a few key points. By extracting a set of interest points such as corners or blobs from training images of an object we can use the local image area immediately surrounding the interest point to calculate a feature vector which we assume serves to uniquely describe and identify that point. A database of local features registered to a 3D model of the object is constructed off-line. At runtime interest points are detected in the camera image. Object detection now becomes a problem of matching features between the training set and camera image by comparing feature descriptors. Once 2D image to 3D model correspondences are established they can be directly used to calculate the object’s 3D pose. Readers are referred to [Mikolajczyk and Schmid 2005; Mikolajczyk, Tuytelaars et al. 2005] for an in-depth comparison of different feature detection and descriptor algorithms. Figure 2.42 (left) SIFT local feature based AR method, (right, top) SIFT Features detected on a mug, (right, bottom) AR teapot added to camera display [Gordon and Lowe 2004] Local features are demonstrated for marker-less real-time object detection and tracking [Gordon and Lowe 2004; Gordon and Lowe 2006]. Here Gordon and Lowe use an off-line metric model-building phase to initially acquire scene geometry with SIFT scale and rotation-invariant features [Lowe 2004]. The 3D feature model can then be used on-line for near real-time detection and tracking with local features for AR, as show in Figure 2.42. Affine viewpoint-transform invariant features which deform their shape to the local region orientation have also been proposed [Matas, Chum et al. 2002; Mikolajczyk and Schmid 2002]. The geometry and intensity based affine region trackers proposed are also invariant to linear changes in the illumination, increasing robustness for real-world environments. However, the number of actual features detected is less as only invariant features are kept and (due to their invariance) these features are also less discriminative, hence and the possiblity of mis-matches increases. 2.6.3 Multi-Cue Detection and Tracking As objects vary significantly in their appearance, approaches based on a single cue such as just colour or shape can perform poorly in real-world environments. No single visual cue is general enough but also robust enough to cope with all possible 45 2.6 NATURAL APPEARANCE DETECTION AND TRACKING combinations of object appearance and environment, hence, multi-cue detection systems have been proposed. The goal of multi-cue approaches is to increase robustness for detection and tracking in dynamic environments. In theory, a combination of complementary cues leads to an enlarged working domain, while a combination of redundant cues leads to an increased reliability in detection [Spengler and Schiele 2001]. Popular cue combinations are edges and vertices of the 3D model [Hirose and Saito 2005], edges and texture [Vacchetti, Lepetit et al. 2004], colour and edges [Li and Chaumette 2004], colour and texture [Brasnett, Mihaylova et al. 2005], intensity, shape and colour [Spengler and Schiele 2001], shape, texture and depth [Giebel, Gavrila et al. 2004]. However, these combinations are fixed at runtime. Extended or unscented Kalman filters have been widely used for multi-sensor fusion [Welch and Bishop 1997; You and Neumann 2001; Foxlin and Naimark 2003]; however, Kalman filters require a good measurement model. As objects and cameras can be mobile or handheld in AR, it is difficult to define a model suitable for such unpredictable motion [VanRhijn and Mulder 2005]. Typically, multiple detection cues are fused in particle filter tracking frameworks, allowing multiple target hypotheses to exist simultaneously [Spengler and Schiele 2001; Giebel, Gavrila et al. 2004; Li and Chaumette 2004; Brasnett, Mihaylova et al. 2005]. In this case the detection of each cue is often assumed to be independent and they are democratically integrated to contribute to the overall measurement density. Hence, when different image cues are fused simultaneously, the failure of one cue will not affect the others [Li and Chaumette 2004]. Several other methods have been proposed in multi-cue tracking for AR, for example using edges and vertices of the 3D model [Hirose and Saito 2005]. Similarly, Vachetti et al., use edges of the 3D model and texture [Vacchetti, Lepetit et al. 2004]. Here, they extend the RAPiD tracker to consider multiple contour candidates instead of just the closest, to solve edge ambiguities. The correct match is then selected during pose optimization using a robust estimator. Texture is detected as Harris corners [Harris and Stephens 1988] on the surface of the object and both are fused using the method described [Vacchetti and Lepetit 2004]. This fusion makes both the registration accuracy more stable and allows the system to handle textured and un-textured objects, however, it does not take into account rapid camera or object movement. In our approach we also use multiple detection algorithms to detect smart objects, but cues are chosen at runtime by a selection step based both on the appearance knowledge contained within a smart object and the object’s context (such as the background). 2.6.4 Camera Model Understanding how cameras and projectors are modelled is important to understand how an object’s pose is calculated and how projectors and cameras are calibrated in this thesis. In this work we use a standard pinhole camera model for both projectors and cameras, formulating the projection of light rays between 3D space and the image plane of the camera or projector. We assume this projection is a perspective projection. As can be seen in Figure 2.43 (left), the camera image plane (where the projection of 3D points is formed) is modelled as a 2D plane in the X,Y axes with the Z axis towards the object. P(x,y) is the principle point, where the Z axis (representing the optical axis of the camera) intersects the image plane. 46 2.6 NATURAL APPEARANCE DETECTION AND TRACKING Figure 2.43 (left) Pinhole Perspective Camera Model, (right) Checkerboard pattern for camera calibration with overlaid pattern coordinate system The 3D coordinates of Object X are defined as [X,Y,Z]T and the corresponding projection on the camera image plane (x) is [xc,yc] T. These are related by the equation sx = PX, where s is a scale factor and P is a 3x4 projection matrix defined up to scale. The projection matrix P can be decomposed as: [ ] P = K Rt (2.6) where K is a 3x3 matrix of the intrinsic optical parameters of the camera (the camera calibration matrix) and [R t ] is a 3x4 matrix of extrinsic parameters (R represents a 3x3 rotation matrix and t a translation) defining the transformation of the object (or camera) from the world coordinate system to camera coordinate system [Lepetit and Fua 2005]. ⎡ fx K = ⎢⎢ 0 ⎢⎣ 0 skew Px ⎤ fy Py ⎥⎥ 0 1 ⎥⎦ (2.7) The intrinsic parameter matrix (K) is composed of fx,fy which are the focal lengths of the lens in the X and Y axes respectively. These focal lengths are determined in terms of pixels per unit distance in the respective axes. Px,Py are the pixel coordinates of the principal point in the camera image. We assume rectangular pixels, hence skew = 0. In this work we use the Zhang’s camera calibration method [Zhang 2000] to recover the camera intrinsic parameter matrix (K) from images captured of a planar checkerboard pattern in different orientations and at different distances to the camera, as shown in Figure 2.43 (right). Additionally, as a camera lens is not optically perfect, we assume a second order lens distortion model with radial and tangential distortion [Hartley and Zisserman 2003]. Lens distortion is estimated at the same time as calibration with Zhang’s method. All camera images captured in this work are initially corrected for lens distortion before use. Readers are referred to [Hartley and Zisserman 2003] for a full discussion of camera calibration. 47 2.6 NATURAL APPEARANCE DETECTION AND TRACKING 2.6.5 Pose calculation As an object moves around an environment its appearance changes depending on its relative orientation and distance to the camera. As we are interested in detection of objects in unconstrained real-world scenarios we aim to recover the full 6 degree-offreedom 3D location and orientation of the object relative to the camera when it is in any pose, or at any scale. Pose Calculation involves calculating the transformation which allows the best fit of an object’s 3D model to features detected in a 2D camera image. This is equivalent to calculating [R t ] , the extrinsic parameters of the object (or camera). Many different approaches have been proposed, but most approaches require correspondences to already have been established between the image and 3D model before the pose calculation step. Some of the most common methods are described below: A 3D object pose can be calculated directly from correspondences using a Direct Linear Transform (DLT). This method is similar to the homography projector calibration process described in section 2.3.6, however, the formulation is now a linear system of equations for 3D-to-2D projection. A unique solution can be obtained to the equations using Singular Value Decomposition (SVD) when the intrinsic parameters are known and there are 6 or more correspondences. However, as the DLT algorithm minimises algebraic error in its solution, the calculated pose may not be the best geometric fit of the model. Instead, we can re-formulate the equations to give us the minimum re-projection error between the 3D points and their 2D coordinates: [R t ] = arg min ∑ dist 2 ( PX i , xi ) [R t ] i (2.8) This can now be solved using linear least squares minimisation techniques or nonlinear iterative approaches [Lepetit and Fua 2005]. The iterative approaches such as Gauss-Newton and Levenberg-Marquardt can also be used as a way to improve accuracy following an initial pose calculation such as DLT, or when we have an initial estimate of the pose (for example, from other non-vision-based sensors). As any incorrect correspondences in the pose calculation process causes errors in the calculated pose, a robust estimator such as RANSAC [Fischler and Bolles 1981] or an M-estimator is typically used in model-based tracking. The two approaches are complimentary, with M-estimators producing accurate solutions but requiring an initial estimate, while RANSAC does not requiring an estimate but typically uses only a subset of all correspondences. RANSAC itself is a simple iterative algorithm, randomly extracting the smallest set of correspondences required to calculate a pose (Fischler and Bolles use 3 correspondences). After pose calculation the algorithm projects all 3D model points and measures the re-projection error to the corresponding 2D points detected in the camera image. If the points are projected close enough the corresponding points are treated as inliers. RANSAC finally returns the pose with the largest number of inliers, theoretically eliminating incorrectly matched correspondences. DeMenthon and Davis’ POSIT algorithm first calculates an approximate pose by assuming the camera model is a scaled orthographic projection [DeMenthon and L.S. Davis 1995]. With a simpler camera model there are less unknown components of the projection, hence with more than 4 corresponding points a pose can be calculated by 48 2.6 NATURAL APPEARANCE DETECTION AND TRACKING solving a set of linear equations. After the initial estimate each point is weighted based on the re-projection error (scaling the coordinates) and the process is iterated until it converges. The POSIT algorithm does not work when the model is planar; hence, Oberkampf et al. extend the original algorithm for the coplanar case [Oberkampf, DeMenthon et al. 1996]. As incorrect correspondences in the pose calculation process cause an incorrect pose to be calculated, in more recent work David et al. propose combining the POSIT algorithm with a softassign approach [Gold, Rangarajan et al. 1998]. This allows simultaneous determination of pose and correspondence with the softPOSIT method for points [David, DeMenthon et al. 2002], lines [David, DeMenthon et al. 2003] and line features when there are high-clutter backgrounds [David and DeMenthon 2005]. In contrast, voting approaches such as the Generalised Hough Transform (GHT) only represent object structure implicitly for efficient matching, avoiding the need for costly methods to group features for robust pose calculation (such as RANSAC). Instead, matching feature pairs (such as 3D model lines and image edges) are converted into votes for a rigid transformation which would align the corresponding features, assuming the matches are correct. Votes from multiple matches for a consistent transformation make peaks in a voting histogram; hence the transform is robust to outliers. This approach is used with Lowe’s SIFT local features as an initial clustering step to reduce outliers before RANSAC [Brown and Lowe 2002; Lowe 2004]. Geometric hashing approaches use a similar approach, but use a hash table instead of a multi-dimensional voting space for performance reasons [Wolfson 1990; Lamdan and Wolfson 1998]. Hence, while the GHT quantises all possible transformations between a model and object into its bins, geometric hashing quantises only a set of discrete transformations represented by the hash basis. In our work we first establish correspondences between features such as interest points or lines in the camera image and the object’s 3D model, then use a robust approach combining DLT and RANSAC for initial pose estimation and a small number of Gauss-Newton iterations to refine the pose and increase accuracy. Readers are referred to Hartley and Zisserman for a complete introduction to pose calculation [Hartley and Zisserman 2003]. 2.6.5.1 Pose Jitter One of the major problems with pose calculation is jitter, which can arise even with a static camera and object due to two main factors: The first is that mismatched features or noise in the image affects the calculation, so when calculating pose from a few correspondences even small differences in the location of detected features in the camera image can have a large effect on the overall pose calculation. Robust matching techniques such as RANSAC can be used to reject outliers with large errors when the solutions are over-constrained with many correspondences, however, they can still include features with small reprojection errors. The second factor is that numerical methods calculating a Perspective transformation from n Points (PnP) have multiple valid solutions for small numbers of correspondences. For example, the original RANSAC algorithm uses 3 points, which has up to 4 possible solutions. The results can usually be constrained, for example, by ensuring the calculated pose places the object in front of the camera rather than behind. However, this problem cannot be entirely eliminated in the case where ambiguous data 49 2.7 VISION-BASED DETECTION WITH PHYSICAL SENSING causes a model to fit in multiple valid ways. One example for humans is the popular Necker Cube optical illusion, where with certain line or corner arrangements a simple cube appears inverted. In vision systems this can lead to flipping between two poses. Both these factors lead to jitter, which can either be smoothed using a motion model, minimised either using an approach such as keyframes discussed in section 2.6.6, or by employing stronger constraints. One solution using stronger constraints is registration with planar structures such as polygons or circles in the scene. This approach can be useful as many planar surfaces exist in everyday environments. For example, Ferrari et al. track a planar surface under affine transformations and overlay virtual textures [Ferrari, Tuytelaars et al. 2001]. 2.6.6 Hybrid Detection To trade-off the advantages and disadvantages of both the maker-based and markerless approaches hybrid detection approaches have been proposed. Genc et al. first proposed a learning-based approach, where markers are initially used to train a detection system while corners are extracted from the scene [Genc, Riedel et al. 2002]. A bundle adjustment is then performed off-line to reconstruct object geometry. This model is then used for tracking with a robust corner matching algorithm. Similarly, Bougeois et al. use pose information from initial marker detection to remove the requirements for hand initialisation of a 3D model-based tracker for successive frames [Bourgeois, Martinsson et al. 2005]. Recently, as computer processing power has increased a real-time tracking by detection approach has become feasible in markerless tracking approaches [Lepetit and Fua 2005]. However, as illustrated by Lepetit and Fua, detection in each frame independently has three main problems: reduced accuracy, increased jitter in the recovered pose and increased processing requirements over narrow-baseline matching. Consequently, imposing temporal continuity constraints across frames can help increase the robustness and quality of the results. For example, Vacchetti et al. propose a hybrid algorithm, combining a small number of key-frame models (generated from training images of the object in known pose) together with a real-time bundle adjustment algorithm [Vacchetti and Lepetit 2004]. This formulates the tracking as both widebaseline matching to the key-frames and narrow-baseline matching to previous frames. The hybrid detection-tracking approach copes with large viewpoint changes due to the key-frames, while reducing pose jitter and drift. Hybrid methods using vision detection with physical sensors have also been proposed. These are discussed further in section 2.7. 2.7 Vision-Based Detection with Physical Sensing There are many failure modes for vision-based detection: giving false positives where no object exists, incorrect classification, or failing to detect an object. Common reasons for failure can be classified as: environment-related failures (such as significant changes in the illumination), motion-related failures (such as fast movement of the object or camera causing blurring of the image), distraction related failures (for example, where multiple objects with identical appearance are present), or occlusion-related failures where objects or environment partly or fully occlude object we want to detect. 50 2.7 VISION-BASED DETECTION WITH PHYSICAL SENSING To address these problems, researchers have used sensing in combination with visionbased detection. Most related work falls into two categories: 1. Sensing for Pose and Object Motion Prediction 2. Structured Light Sensing for Location and Pose 2.7.1 Sensing for Pose and Object Motion Prediction Vision-based detection and pose calculation faces the problem that any rapid motion of the camera or object during the camera exposure generally causes blur. This motion blur changes an object’s appearance; hence, can easily cause a loss of tracking. Hybrid vision-inertial detection systems are often used to overcome this problem. For example, Klein and Drummond [Klein and Drummond 2003], You et al. [You, Neumann et al. 1999] and Chandraker et al. [Chandraker, Stock et al. 2003] all use inertial sensing to predict object motion when an object is not detected visually. Here, the vision and sensing approaches attempt to compensate for the limitations of the other technology, with vision to correct drift in the inertial system, and the inertial system to compensate for blur or occlusion in the camera image. Aron et al. use a similar approach [Aron, Simon et al. 2004], but here inertial sensing is used directly for guided local feature matching. In contrast, Kotake et al. present a different approach, using only an inclinometer sensor in the camera to constrain the detection and pose calculation process [Kotake, Satoh et al. 2005; Kotake, Satoh et al. 2007]. Combined vision-based and optical sensing methods have also been used. Klein et al. track a tablet using the edges in its 3D model and IR LED markers [Klein and Drummond 2004]. The LED are detected with a separate high-speed fixed camera, allowing detection and pose calculation even under large or fast movements which cause the edge features to blur and vision-based tracking to fail. However, in common with many hybrid systems using outside-in trackers, their system requires an additional offline calibration procedure to calculate the transformation between the optical sensing and vision-based camera coordinate systems. Everyday objects face many problems when integrating sensing as part of their detection process. For example, inertial sensors are expensive; hence they cannot be routinely integrated with all objects. Separate outside-in trackers (such as optical or magnetic trackers) either require additional hardware in the object or environment and explicit calibration before use. 2.7.2 Structured Light Sensing for Location and Pose Lee et al. [Lee, Dietz et al. 2004; Lee, Hudson et al. 2005] propose using light sensors in the objects to help address some of the problems associated with vision-based detection approaches, such as the problems of figure-ground separation (identifying what is the object and what is the background), variable lighting conditions, material reflectance properties (as reflective or transparent objects are challenging to track) and non-planar or non-continuous surfaces. Lee et al. embedded light sensors in the corners of an augmented portable display screen and projected a series of structured light gray codes towards the object. The sensors detect and transmit observed light values back to the projector and the projector directly locates the display screen in its frame of reference based on the detected changes in brightness over time. Unfortunately due to the temporal coding used this approach suffers form two problems, firstly that the user sees a distracting set of 51 2.7 VISION-BASED DETECTION WITH PHYSICAL SENSING flickering patterns on their object, and secondly the gray code localisation takes up to a second, so cannot be used to augment mobile objects in real-time. Figure 2.44 (left) Projection on mobile planar surfaces with single light sensor [Summet and Sukthankar 2005], (right) Projection onto surfaces with sensors at each corner for rotation information [Lee, Hudson et al. 2005] Summet and Sukthankar extended this to real-time interaction on mobile screens [Summet and Sukthankar 2005], by restricting the size and location of projected patterns to the immediate area around the screen corners (as shown in Figure 2.44). This implementation reduced the flickering effect and allowed the remainder of the projection area to be used for display, however the update rate still limited movement speeds to slow hand motions before tracking was lost. Raskar et al. demonstrated similar objects which sense projected structured light [Raskar, Beardsley et al. 2004], however, their implementation reverses the display paradigm by assuming a dynamic handheld projector used like a virtual flashlight and static objects. In this case, smart tags attached to objects use active RFID instead of a wireless RF link to return observed light values to the handheld projector. A gyroscope embedded in the projector also allows the object’s relative 3D location to be calculated directly based on movement of the handheld projector coupled with the changes in the object’s gray code value. More recent work by Raskar et al. on their Prakash system [Raskar, Nii et al. 2007] presented a high-speed system that detects 3D location and orientation of photo-sensing tags using multiple Infra-Red (IR) projectors. Cheap Light Emitting Diodes (LED) were used as the light source, replacing the expensive video projectors used previously. The demonstration system achieved over a 500Hz update rate, based on fast switching of the LEDs and the use of sequentially illuminated dedicated binary code masks in the projector. Structured light techniques require a minimum of one un-occluded light sensor in view of the projector to enable detection, or three light sensors for calculation of 3D location and orientation. The Prakash system [Raskar, Nii et al. 2007] reverses the paradigm, in that it requires only a single sensor, but a minimum of three LED projectors to calculate 3D location and pose. However, both approaches require many more sensors to guarantee correct pose calculation when used with 3D or self-occluding objects. For example, cubical objects would require either 18 light sensors (3 per face), or 6 for the Prakash system to order to detect all poses. In contrast to the approaches that use magnetic, inertial or light sensors directly, in our work any movement sensor information available in a smart object is used to constrain the detection task. This allows indirect and opportunistic use of sensing to 52 2.8 SUMMARY increase detection robustness, without requiring expensive sensors, obtrusive sensors, or a minimum number of sensors to operate. 2.8 Summary As we have seen in section 2.2.4, there is potential for output capability in smart objects to benefit the user by redressing the input-output imbalance in physical tangible interfaces. We reviewed projector-based augmented reality approaches as a suitable means to create non-invasive displays on objects and projector-camera system hardware used for the display itself. Smart objects can then achieve a display on their surfaces by cooperating with projector-camera systems. Of the approaches surveyed we integrate geometric, photometric and colorimetric correction in our work to enable displays that are undistorted and visible for an observer. Of the hardware surveyed we construct a steerable projector-camera system to enable vision-based detection, tracking and projection onto objects. This system is discussed further in Appendix A. We reviewed computer vision approaches suitable for achieving vision-based detection and tracking of smart objects. Of the approaches surveyed in our work we integrate the markerless natural appearance detection cues of colour, texture, shape and features of objects described in section 2.6. We use multiple natural appearance cues as a single cue is not robust enough to detect all objects in every situation; however, we need to understand how to combine the cues to achieve the best detection performance. A first step towards this understanding is to investigate how the natural appearance detection methods perform with the object in different detection conditions, such as with scaling and rotation. To increase robustness to some of the failure modes associated with pure visionbased detection approaches we integrate a hybrid vision and physical sensing detection method in our work. We propose using the sensing capabilities of a smart object in cooperation with vision-based detection in the projector-camera system, but we need to understand how sensing helps detection and what sensors are best suited for use. This cooperative approach between smart object and projector-camera system is formalised in the Cooperative Augmentation Framework presented in Chapter 3. 53 2.8 Chapter 3 SUMMARY Cooperative Augmentation Conceptual Framework In this chapter we present Cooperative Augmentation, a framework enabling smart objects and projector-camera systems to cooperate to achieve non-invasive projected displays on the smart object’s surfaces. There are eight defining characteristics of the Cooperative Augmentation concept: 1. Generic, ubiquitous projector-camera systems offering a display service. All knowledge required to detect, track and project on objects was traditionally held by the projector-camera system in smart environments. In contrast, this knowledge is distributed among the smart objects in our approach, so that each smart object now contains the knowledge required achieve a display. This reduces the projectorcamera systems to providing a generic display service, allowing us to assume they are ubiquitous in the environment. 2. Spontaneous cooperation between smart objects and projector-camera systems. Projector-camera systems in the environment are able to support spontaneous interaction with any type of smart object. An object simply registers for use of the generic display service to obtain an output capability on its surfaces. 3. Smart objects embodying self-description knowledge. We assume smart objects are both real world objects with an inherent use and autonomous computational nodes. Objects cooperate with the projector-camera systems to achieve a display by describing knowledge they carry which is vital to the visual-detection and projection process, such as knowledge of their appearance. We call this information the “Object Model”, as it is a representation of the object’s appearance, form and capabilities. 4. Dynamic tailoring of projector-camera system services to smart objects. The projector-camera system uses the Object Model to dynamically tailor its services to the object. The Cooperative Augmentation approach is flexible, as a dynamic configuration process caters for varying amounts of knowledge in the object. All configuration occurs automatically in response to the knowledge embodied by the Object Model. 5. Using smart object capabilities to constrain detection and tracking. When sensor information is available from the object, this can be integrated in the detection and tracking process, allowing us to dynamically constrain the process and increase visual detection performance. 6. Smart objects control interaction with projector-camera systems. After detection the smart object controls the interaction with projector-camera 54 3.1 COOPERATIVE AUGMENTATION ENVIRONMENT systems. The smart object issues projection requests to the projector-camera system, controlling how the projected output on its surfaces changes and allowing direct visual feedback to interaction. 7. Displays on the objects themselves, without modifying object appearance or function. Unlike physical embedded displays, the projected display is a temporary display which does not permanently modify the object’s appearance or function. 8. Projector-camera systems dynamically update knowledge held by the object. Over time, the camera system extracts additional knowledge about the object’s appearance, and re-embeds this within the object, enriching the original Object Model and enabling increased detection performance. This chapter expands the Cooperative Augmentation concept by explaining the four key areas of: the Cooperative Augmentation environment, the Object Model representation of the Smart Objects, the projector-camera system model and the actual Cooperative Augmentation process itself. 3.1 Cooperative Augmentation Environment We assume all smart objects, projectors and cameras exist in a shared threedimensional space, which we call the “environment”. This allows us to locate each object in a shared frame of reference and easily model the relationships between devices. We term the shared frame of reference the world coordinate system, which is modelled as a three-dimensional Cartesian system. This can have an arbitrary origin in the physical world. 3.2 The Object Model The “Object Model” is a description of a smart object and its capabilities, allowing the projection system to dynamically configure its detection and projection services for each object at runtime. We assume the Object Model knowledge is initially embedded within the object during manufacture. However, the knowledge can also be extended and added to by projector-camera systems. The model consists of five components: 1. Unique Object Identifier This allows an object to be uniquely identified on the network as a source and recipient of event messages and data streams. For example, by the IP address of the object’s hardware. 2. Appearance Knowledge This knowledge describes the visual appearance of the smart object. The description is specific information extracted by computational methods from camera images of the object. For example, knowledge about the object’s colour, or locations and descriptions of features detected on the object. 3. 3D Model A 3D model of the object is required to both allow a projector-camera system to 55 3.3 THE PROJECTOR-CAMERA SYSTEM MODEL compute the object’s pose and enable the framework to refer to individual surfaces. 4. Sensor Knowledge The sensor model is a description of the data delivered by the object’s sensors. The data type is classified into three groups with regard to the originating sensor: movement sensor data, light sensor data and others. The data is further classified into streaming or event-based, depending on the way sensor data is output from the smart object. The model contains associated sensor resolutions, and sensor range information to allow the framework to interpret sensor values. 5. Location and Orientation of the Object When an object enters an environment, it does not know its location and orientation in the world coordinate system. A projector-camera system provides this information on detection of the object, to complete the Object Model. 3.3 The Projector-Camera System Model A projector-camera system consists of a projector, camera and their controlling systems. While many projector-camera systems are typically co-located devices (such as steerable projector-camera systems), we model the physical projector and camera as independent objects. However, each requires knowledge of its current location and orientation in the world coordinate system. This knowledge can be obtained by 5 methods: 1. Direct measurement for static devices. 2. Self-calibration and location methods for static devices (i.e. combining techniques discussed in sections 2.3.3, 2.3.6 and 6.4, or multi-camera calibration techniques [Sinha and Pollefeys 2006]). 3. Vision-based Simultaneous Location And Mapping (SLAM) methods (e.g.[Davison and Murray 2002; Davison 2003; Chekhlov, Gee et al. 2007; Klein and Murray 2007]) for mobile devices. 4. By calculation, where steerable hardware is installed in a known orientation [Spassova 2004]. 5. From 3D location and orientation sensing systems attached to the device. This approach allows virtualisation of a projector-camera system pairing across multiple projectors and cameras. The framework can now use any projector or camera hardware distributed in the environment in addition to static, steerable, mobile and handheld projector-camera systems. For example, in an environment with many distributed fixed cameras and a handheld projector, the camera used as part of a projector-camera system pair could vary depending on the location of the projection. In this case, as each device has knowledge of its location and orientation we can calculate the closest camera or the camera with the best view of the projection. We assume projector-camera pairs only exist when the respective viewing and projection frustums overlap, allowing objects detected by the camera to be projected on by the projector. 56 3.4 COOPERATIVE AUGMENTATION PROCESS A projector-camera system has five capabilities in our framework: 1. To provide a service allowing smart objects to register for detection and projection. 2. To detect smart objects in the camera images and calculate their location and orientation based on the knowledge and sensing embedded in the object, as explained in section 3.4. 3. To project an image onto an object in an area specified by the smart object, or choose the area most visible to the projector. 4. To perform geometry correction to a projected image so that the image appears to be attached to the object’s surface and is undistorted. 5. To perform photometric correction to a projected image, compensating for variation in an object’s surface colour and texture so the image appears more visible. In addition, the framework allows explicit modelling of steerable projector-camera system pairs. In this case we assume the projector and camera are co-located and their respective viewing and projection frustums constantly overlap. A steerable projector-camera system has two additional capabilities: 1. To search an environment for smart objects by automatically rotating the pan and tilt hardware. 2. To track detected objects by automatically rotating the pan and tilt hardware to centre the detected object in the camera and projector frustums. 3.4 Cooperative Augmentation Process We decompose the cooperative augmentation of an object into five steps: 1. Registration As an object enters the environment it detects the presence of a location and projection service through a service discovery mechanism. The object sends a message to the projector-camera system requesting registration for the projection service to display messages. On receipt of the registration request, the projectorcamera system requests the Object Model from the smart object. 2. Detection Following registration, the object begins streaming sensor data to the projectorcamera system, as shown in Figure 3.1 (A). This data is used in combination with the Object Model to constrain the visual detection process and generate location and orientation hypotheses (B). When an object is located with sufficient accuracy, a location and orientation hypothesis is returned to the smart object (C) to update the Object Model. This process is explained in more detail in section 3.4.1. 57 3.4 COOPERATIVE AUGMENTATION PROCESS B) System detects location of Smart Object Camera Projector C) Update Location A) Send Object Model & Sensor Data Smart Object Figure 3.1 Detection Sequence Diagram 3. Projection When an object has knowledge of its location and orientation it can request a projection onto its surfaces. For example, as the projector sends location information to the object, if an object is placed in the wrong area of the environment it could request a warning message is projected until moved to the correct location. This projection request message contains both the content to project and location description of where on the object to project the content, as shown in Figure 3.2 (A). Any geometric distortion that would appear when projecting on an object non-orthogonal to the projector is automatically corrected. The projection image is additionally corrected for the surface colour of the object to make it more visible to the user (B). The projector system starts displaying the corrected content on the objects surfaces immediately on receipt of the request, if the object is in view and the projector system is idle (C). This process is explained in more detail in section 3.4.2 B) Calculate Projection with with Geometry and Colour Correction Projector Camera C) Projection A) Request Projection Smart Object Figure 3.2 Projection Sequence Diagram 4. Interaction with smart object A requested projection is active as long as the object is detected, including during movement or manipulation of the object. Consequently, smart objects can give 58 3.4 COOPERATIVE AUGMENTATION PROCESS direct feedback to the user in response to the manipulation or movement of the object by changing the projection. Interactive user-interface components can also be projected, exploiting a user’s experience with traditional desktop interfaces while allowing direct visual feedback on the object itself. The possibilities for interaction with smart objects are discussed in more detail in section 6.2.4. 5. Update Appearance Additional information about the appearance of an object’s surfaces can be extracted once the object has been detected and its pose calculated. As part of the cooperative process this new knowledge can be re-embedded into the Object Model for faster and more robust detection on next entry to an augmented environment. Even if an object is already detected reliably with one detection method, extracting more knowledge is beneficial as the environment can also change. For example, when distracting objects are introduced with similar appearances, the wall is painted a different colour, or the object is simply taken to another room. Similarly, the appearance updating capability allows deployment of new detection algorithms to the projector-camera system and automatic update of the object’s appearance description to include the new algorithms following first detection with another method. 3.4.1 Detection The Object Model transmitted to the projector-camera system contains an appearance description which allows visual detection of the object. The projector-camera system dynamically configures its visual object detection processing based on the type of appearance knowledge in the Object Model, and the sensors the object possesses. As seen in Figure 3.3, this processing involves computation of one or more vision detection methods on images from a camera system, location and orientation (pose) computation for one or more object location and orientation hypotheses and return of the best hypothesis to the Object Model. Camera Image Acquisition Appearance Knowledge Background Model Detection Sensing Pose Computation Smart Object 3D Model Location and Orientation Hypothesis Re-embedding Figure 3.3 Knowledge flow in the detection process 59 3.4 COOPERATIVE AUGMENTATION PROCESS The cooperative augmentation framework serendipitously uses sensors the object possesses to constrain the detection process. One typical class of sensor that can be used is movement sensors. Common sensor hardware that can be used for movement detection are accelerometers, ball-switches, tilt-switches and force sensors which detect pick-up and put-down events. If an object is moving, we use visual differences generated between the camera image and a background model to provide a basic figure-ground segmentation in the detection process, increasing the probability of correct detection. Maintaining a background model also allows us to take the object’s context into account in the detection process. For example, if we know the object’s colour is similar to the background colour we would not use a colour detection method, as the probability of detection is low. 3.4.2 Projection Following projection requests there are cases where projection cannot begin immediately, such as where the projection system is busy, the object is occluded or the object is out of the field of view of the projector. Here the display requests are cached at the projector system and the projection commences when the object is in view and the projector is available. Projection requests are displayed sequentially and simultaneous projection onto multiple objects is possible when all objects are within the field of view of a projector-camera system. If multiple projectors exist in the environment, the display request applies simultaneously to all projectors. This allows the object to roam freely in an environment and achieve a display whenever it is in the field of view of a projector. To prevent multiple projectors overlaying displays onto the same surface a display rights token system is introduced. Here, each projector determines the visual quality of its projection based on the distance and relative orientation of the object’s surface to the projector using the metric introduced by Ehnes and Hirose [Ehnes and Hirose 2006]. Closer, more orthogonal projectors score higher, allowing a ranking to be performed and the best projector assigned display rights either for the whole object or on a per-surface basis. A rectangular image projected on to a non-perpendicular or non-planar surface exhibits geometric distortion. We compensate for this distortion by warping our projected image, as we know both the surface geometry of the object and the pose of the surface with respect to the projector. We obtain the surface shape from the geometric 3D model embedded within the Object Model and the pose of the object from the detection process. As we have seen in related work section 2.3.6 the geometric correction methods required depend on the surface geometry, with curved surfaces requiring a different approach to planar surfaces. Hence, we use the surface shape to directly configure the type of geometric correction applied in projection. As one goal of the framework is to avoid the physical modification of the appearance of smart objects, their surfaces can present a challenge to projection. Smooth, diffuse and light coloured object surfaces are ideal for projection; however, few everyday objects exhibit these characteristics. Certain combinations of projected content and object surface colour can make the projection almost invisible to the human eye, for example, when projecting yellow text on a deep red background. Conversely, with a smooth, diffuse, light coloured object, the projection illumination on the object can significantly alter its appearance, causing the visual detection process to fail. To compensate for this problem we use photometric and colorimetric correction techniques discussed in section 2.3.7 in the projection process. 60 3.5 3.5 CONCLUSION Conclusion This chapter expanded and formalised the Cooperative Augmentation concept introduced in Chapter 1, to provide detailed information about the four key areas of the framework (the Cooperative Augmentation Environment, the Object Model, the projector-camera system model and the cooperative augmentation process). We illustrated how knowledge from the Object Model and sensing can be used with a projector-camera system to cooperatively detect an object and project onto its surfaces. An example implementation of this framework is discussed in Chapter 6 and Chapter 7. The next two chapters look in more detail at the detection process. Firstly, Chapter 4 investigates natural-appearance vision-based detection algorithms, while Chapter 5 explores how embedded sensing in the smart objects can be used in cooperation with visual detection to improve detection performance. 61 4.1 NATURAL APPEARANCE DETECTION Chapter 4 Vision-Based Object Detection A central problem for achieving displays on smart objects is their detection and tracking. As discussed in section 2.4, common approaches to object tracking involve embedding dedicated hardware location systems. However, many systems have limitations such as a small working volume, which precludes their use with mobile smart objects in unconstrained environments. In contrast, vision-based detection is commonly used in experimental prototypes by placing planar fiducial markers on objects. This enables detection of mobile objects anywhere in the camera’s field of view. However, it requires modifying the appearance of an object with visually intrusive markers to enable detection, and for our work it suffers from several key limitations, as discussed in section 2.5. Consequently, with a view to ubiquitous augmentation of objects, it is more realistic to base detection on the natural appearance of objects. While this vision-based detection non-intrusive, it is a significant challenge in real-world environments, as objects naturally vary in their appearance. Hence, there is an open question as to how best to use natural appearance detection and how different methods perform in different detection conditions, such as when an object appears with scaling and rotation as it moves around an unconstrained environment. 4.1 Natural Appearance Detection In this chapter we investigate four detection methods, representing different natural appearance cues of objects (colour, texture, shape and surface features). The rationale for studying approaches that rely on different cues is that objects in real world naturally vary in their appearance – hence we assume that multiple methods should be provided as alternatives for detection. No single cue is both general enough and robust enough to cope with all combinations of object appearance and environment. We perform an experimental study targeted at understanding the impact of object scale and rotation on different detection methods. This is important for detection in realistic scenarios, as objects will appear at varying distances and orientations with respect to the camera. We also look in-depth at the features cue, investigating the impact of invariance to scale and rotation in different feature detection algorithms. As a result of the study, we gain insight into training requirements of different detection approaches to enable vision-based detection. This is important for embedding appearance knowledge into smart objects to achieve initial detection and for the 62 4.2 OBJECT DETECTION METHODS cooperative augmentation process to know when the appearance knowledge updating step is best performed (see section 3.4). 4.2 Object Detection Methods Four detection methods were chosen as being well-known and typical for their respective appearance cues. These four methods form the basis for the studies presented in section 4.4 and Chapter 5: For Colour we use Swain and Ballard’s colour histograms [Swain and Ballard 1991]. We choose to use CIE 1976 L*a*b* colour histograms, as this model was empirically found to detect light and dark objects better than Hue-Saturation-Lightness or RGB colour models. ‘L’ defines lightness; ‘a’ is the red/green value and ‘b’ the yellow/blue value. Unlike the RGB model, Lab colour is designed to approximate human vision, aiming for perceptual uniformity. We calculate a 3-dimensional histogram of the image, with each dimension divided into 16 bins, each 16 values wide. For Texture we use Schiele’s multidimensional histograms of receptive fields [Schiele and Crowley 2000]. We choose to use 2-dimensional Gradient Magnitude and Laplacian histograms due to their invariance to rotation in the camera plane. Each dimension has 32 bins, each 8 values wide. Scale invariance is achieved by training with images of the object at multiple scales then creating a scale space when detecting the object by Gaussian smoothing the image with increasing standard deviation (σ). Here we train with images smoothed with σ=2.0 then use 3 scales in detection, equal to 0.5σ, σ, 2σ, to allow objects at different scales to be detected. For Shape we use Belongie’s shape context descriptors [Belongie, Malik et al. 2001]. The contours on an object are matched with 100 points, using 5 radial bin and 12 angle bin histograms. The descriptors are made scale invariant by resizing the diameter of the radial bins equal to the mean distance between all point pairs, and rotation invariant by averaging the angle of all point pairs and calculating all angles relative to the mean. For Features we first perform both an in-depth study of the Local Features method, where we evaluate 9 detection algorithms: Harris, Harris-Laplace, Harris-Affine, Hessian, Hessian-Laplace, Hessian-Affine, Difference-of-Gaussians (DoG), Laplacianof-Gaussians (LoG) and Maximally Stable Extremal Regions (MSER) [Mikolajczyk, Tuytelaars et al. 2005]. All these algorithms are evaluated in combination with Lowe’s Scale Invariant Feature Transform (SIFT) descriptor [Lowe 2004] . For the remainder of the studies in this work we use only the complete SIFT algorithm presented by Lowe (comprising DoG detector and SIFT descriptor) as this is one of the most popular and widely used local feature algorithms, giving a good compromise between discriminative power and robustness [Mikolajczyk and Schmid 2005]. SIFT is invariant to scale, rotation in the plane of the camera image and partially invariant (robust) to changing viewpoint and illumination. We use the standard scale space image pyramid presented by Lowe, with 3 scales per octave and σ = 1.6. In terms of complexity, the methods range from low (colour histograms) to high (SIFT local features), with texture and shape falling somewhere in between. More information on each of these methods can be found in related work section 2.6.2. 63 4.3 4.3 EVALUATION DATASET Evaluation Dataset We use ten objects as the dataset for experiments, reflecting everyday objects of varying shape, size and appearance (see Figure 4.1). They are a football, a chemical container barrel, a book, a product box, a smart cube, a chair, a cup, a notepad, a cereal box and a toaster. The largest object was the chair (90x42x41cm); the smallest was the cube (9x9x9cm). Figure 4.1 Experiment Objects (left to right, top to bottom): a football, a chemical container barrel, a book, a product box, a smart cube, a chair, a cup, a notepad, a cereal box and a toaster. 4.3.1 Object Appearance Library We created an object appearance library to train algorithms for the experiments presented in section 4.4 and Chapter 5. The library consists of images of the objects with varying scale and rotation against plain backgrounds. Colour images of each object were acquired with even illumination using a Pixelink A742 machine vision camera, with 1280x1024 pixels and a 12mm lens (with a 40.27x30.75° field of view). For the varying scale, images of the objects were captured in 5cm intervals between 1 and 6m from a fixed camera (10x100=1000 images). 6m approximated the size of a large room used in our scenario. The camera was horizontal, perpendicular to the front surface of the objects and at the vertical centre of the object. For rotation, images of the objects were captured at distances of 2m, 3m, 4m and 5m from a fixed camera. At each distance the objects were rotated in 10° intervals for a full 360° around the object’s vertical world axis using a turntable (4x36x10=1440 images total). The camera was fixed 1.5m above the height of the turntable with a 40° declination angle, providing a view of both the top surface and sides of the objects on the turntable between 2 and 5m from the camera. All images were manually annotated with an object bounding box for a ground truth object location. Figure 4.2 (left) Box Object Scale images at 1m, 3m, 6m from camera, (right) Notepad object rotation images at -40°, 0°, +40° 64 4.4 4.4 SCALE AND ROTATION EXPERIMENTS Scale and Rotation Experiments One of the challenges for visual detection algorithms is to reliably detect objects. Real-world objects are composed of different structures at different scales, which means in practice they appear different depending on the scale of observation. This effect can be seen in Figure 4.3, where different numbers of corners are detected and in different locations when the cereal box object appears at different distances to the camera. Figure 4.3 Different numbers of corner features (yellow dots) are detected and in different locations on the Cereal box object at 1m, 3m and 6m distance with the single-scale Harris algorithm [Harris and Stephens 1988] When detecting an object in an unknown scene, there is no way to know at which scale the object will appear, as the distance to the object is unknown. In theory, by representing the object or camera image at multiple scales in a scale-space, detection algorithms can be made scale-invariant. A scale-space is built by successively smoothing an image using a Gaussian kernel with an increasing standard deviation (σ), to remove more and more fine detail [Lindeberg 1990]. We can now try to match the object at different scales or choose the most appropriate scale. Figure 4.4 Camera and Object Detection Coordinate System Transformations Rotation of an object can be decomposed into 2D rotation in the camera plane rz, equivalent to rolling the camera and general 3D rotation (rx,ry) equivalent to changing the camera viewpoint (see Figure 4.4). When an object is rotated in the 2D plane of the camera image (rz), the method we use to detect an object may cause it to appear different. For example, if our detection algorithm relies on first gaussian derivatives (dx,dy) for gradient calculation, the results will change depending on the orientation of the object. Detection algorithms can be made invariant to this 2D rotation, however, general 3D rotation of an object presents another problem. In this case there are two separate aspects to the problem: 1. The appearance of an object surface becomes distorted when it is rotated from being perpendicular to the camera vector tz. 65 4.4 SCALE AND ROTATION EXPERIMENTS 2. As the object rotates, surfaces disappear from view and new surfaces appear. We perform experiments to answer the four following research questions: R1) What is the impact of using local feature algorithms invariant to scale, rotation or affine transformation, rather than non-invariant algorithms? R2) Is scale and 3D rotation an issue for the final 4 detection methods we choose? R3) At what distance do we need to train our 4 methods? R4) Are some of the 4 methods more robust to scale and rotation than others? . 4.4.1 Design 4.4.2.1 Research Question R1 To address question R1 we perform four series of experiments using 9 local feature detection algorithms {Harris, Harris-Laplace, Harris-Affine, Hessian, Hessian-Laplace, Hessian-Affine, Maximally-Stable-Extremal-Regions (MSER), Difference-of-Gaussians (DoG), Laplacian of Gaussians (LoG) } and the SIFT descriptor algorithm: 1. The first experiment investigates the quantity of features detected with scale. 2. The second experiment series investigates detection repeatability over the whole scale range when the algorithms are trained at different distances. For six training distances scale variant algorithms were compared against scale and affine invariant algorithms. 3. The third experiment series investigates 2D rotation of an object in the camera plane (rz) and rotation invariance in feature descriptors by comparing the descriptor matching performance of rotation variant and rotation invariant algorithms under 2D rotation. 4. The fourth set of experiments addresses the general case of local feature detection performance with 3D object rotation (rx,ry) by comparing the detector matching repeatability and descriptor matching performance of rotation variant and rotation invariant algorithms under 3D rotation. 4.4.2.2 Research Questions R2 to R4 R2 to R4 look at the performance of final four detection algorithms: Lab colour histograms, Mag-Lap multi-dimensional histograms for texture, shape context and SIFT for features on objects. We address R2 and R4 by performing another two series of experiments: 1. The first series evaluates the average detection performance over all objects of each algorithm, when the objects are scaled. 2. The second series evaluates the average detection repeatability for SIFT local features over all objects, and the average detection performance of the other three algorithms over all objects, when the objects are rotated. 66 4.4 SCALE AND ROTATION EXPERIMENTS We investigate if scale and rotation is an issue, and compare the results between algorithms to identify whether some detection methods are more robust than others. From this we also gain insights into the training knowledge required to detect objects, addressing R3. 4.4.2 Procedure For the local feature experiments, we define a procedure for calculating detection repeatability and descriptor matching performance as explained below. To compare the relative performance of interest point detectors we use the repeatability criterion described by Mikolajczyk and Schmid [Mikolajczyk and Schmid 2004]. For this we compute the percentage ratio between the number of point or region correspondences (found between the current image from the scale set and the features found on the object in the training image) and the minimum number of points detected in both images. The correspondences are established by projecting points from each image into the other using a manually-annotated ground truth homography. We establish a correspondence when the following two criteria are met: 1. If the point locations are less than 1.5 pixels apart when projected. 2. Additionally, for scale and affine-invariant points, when the image region we project has an overlap intersection error εs < 0.4. This error corresponds to 40% overlap error and is chosen as according to Mikolajczyk et al. as regions with 50% overlap error can still be matched successfully with a robust descriptor [Mikolajczyk, Tuytelaars et al. 2005]. The intersection error of the regions (εs) is defined as the intersection and union of the regions: εs = 1− μ a ∩ ( AT μ b A) ( μ a ∪ AT μ b A) (4.1) where μa and μb are the regions (represented as elliptical regions for affine-invariant features or circles for scale-invariant features, and defined by xTμx = 1), A is the locally linearised ground truth homography relating the images and AT its transpose [Mikolajczyk, Tuytelaars et al. 2005]. The error can be computed numerically, or in our case by counting the number of pixels in the union and the intersection of regions when one region is projected into the second using the homography. Relative matching performance of the descriptors is evaluated by establishing correspondences between descriptors in the test image and descriptors in the training image using nearest-neighbour Euclidean distance. Correct matches are determined by the overlap error (as for detector repeatability), but here we assume a match is correct if the overlap error is < 50% of the region (εs < 0.5). The final matching percentage score (known as Recall) is calculated as the number of correct matches with respect to the total correspondences: Recall = # Correct matches # Correspondances (4.2) 67 4.4 SCALE AND ROTATION EXPERIMENTS 4.4.2.1 Research Question R1 1. For the first series of experiments the 9 local feature algorithms were run on each image from the object appearance library scale set (see section 4.3.1) and the number of features detected inside the ground truth object location bounding box was recorded. 2. The second series is divided into 6 sub-experiments, corresponding to 6 algorithm training distances. Each local feature algorithm was trained at meter intervals between 1 and 6m (inclusive) using images from the object appearance library scale set. For training, feature detection was constrained to only detect features on the object by using the manually annotated ground-truth bounding box. Features were then detected in the remaining images in the scale set and the detection repeatability between the training and test images calculated using the method described above. 3. For each of the images of objects at 3m distance in the object appearance library scale set (see section 4.3.1) a 2D object rotation was simulated by rotating images in 10° increments between 0° and 350° using an affine transform. Simulation was used for this experiment as it was the easiest way to ensure accurate and repeatable rotations across all the objects without the need for specialized hardware to physically rotate the object. The SIFT descriptor [Lowe 2004] is computed for each interest point detected by the local feature detection algorithms in all images. The detection algorithms were trained with images of all objects at 0° rotation and matched against images of the objects at the other rotation angles to compare non-rotation invariant descriptors against rotation invariant descriptors. 4. The experiment is divided into 4 sub-experiments. The 4 sub-experiments are for 4 algorithm training distances, at meter intervals between 3 and 6m. In each subexperiment the 9 algorithms were trained with the 0° rotation images of each object at the respective distance from the object appearance library rotation set (see section 4.3.1). These training images were matched to the -70° to +70° rotation images, comparing both scale invariant against non-scale invariant detector repeatability and non-rotation invariant descriptors against rotation invariant descriptor matching performance. 4.4.2.2 Research Questions R2 to R4 For the final 4 detection algorithms we define a procedure for calculating detection performance as detailed below. For colour histogram detection we use a variation of Bradski’s CAMSHIFT approach [Bradski 1998]. The object histogram is first back-projected into the camera image and the bounding box of the largest blob is detected, representing the most likely object location. The CAMSHIFT algorithm is then used to refine the bounding box size and location based on the original object histogram. For texture detection with multi-dimensional histograms we use an exhaustive search method, dividing each scale image into a grid of scale-adapted 2D windows of uniform size, with 25% partial overlap between each window. Each area’s histogram is calculated and matched against the object’s histogram. For object location use a meanshift [Comaniciu and Meer 2002] clustering approach similar to Liebe and Schiele’s scale-adaptive method [Leibe and Schiele 2004] when match result maxima are greater 68 4.4 SCALE AND ROTATION EXPERIMENTS than a pre-determined threshold. Hence, the 3D mean-shift clustering function (X,Y,Scale) acts as a Parzen window probability density estimation for the position of the object, allowing a more accurate location hypothesis. We increase the clustering window size as scale increases above 1.0 as the histogram matching results are spread over a larger area. For shape detection we use a similar approach to texture, but restricted to single scale, due to the scale-invariance of our shape context algorithm. For colour histogram, shape context and multi-dimensional histograms algorithms, correct detection was assumed when the detection bounding box had <50% overlap error with the ground truth bounding box from the test image library. For SIFT local features correct detection was assumed when a minimum of 8 features were matched to the training image using nearest-neighbour Euclidean distance matching and >50% of feature correspondences were correct. 1. The first experiment set investigates scale-invariance in the 4 detection algorithms and aims to quantify in what scale range we can repeatably detect an object. We performed 6 sub-experiments, each with a different training distance. All of the algorithms were trained with an image at meter intervals between 1 and 6m (inclusive), resulting in the 6 sub-experiments. The remaining images of the object appearance image library (99 images) were used for testing. The results for all four cues are averaged over all objects to give the percentage of all detected objects over the scale range (every 5cm). 2. The second set of experiments investigates 3D rotation. For each object we trained the algorithms using a single 0° image from the object appearance library. The remainder of the images between -80° (anti-clockwise rotation of object from 0°) and +80° (clockwise rotation) were used to evaluate the percentage of objects detected with each algorithm. We both trained and tested the algorithms with images of the objects at 3m distance as this is the centre of our working range. 4.4.3 Apparatus A 3.4GHz dual core Pentium-4 computer running Windows XP SP2 was used for all experiments in this thesis. The detection algorithms were implemented in C++ using Intel OpenCV API for image processing 4.4.4 Results We first present results for the first four experiments to investigate local feature algorithm performance with scaling and rotation, followed by results of the detection performance for the final four selected detection algorithms for colour, texture, shape and features. 4.4.4.1 Research Question R1: Local Features with Scaling and Rotation Although we performed the experiments for all objects in the object appearance library, we only present the results for the Mediacube object below as the general trends for all objects are the same, hence, these results are representative of the other objects. Additionally, we only present results for the Harris-based family of detectors as the result trends are also representative of the relative Hessian-based detector performances. 69 Number of Features Detected by Different Harris-Based Detectors 150 Number of Features Harris Harris-Laplace Multi-Scale Harris-Laplace Scale-Invariant Harris Affine-Invariant 100 50 0 1 1.5 2 2.5 3 3.5 4 Distance (m) 4.5 5 5.5 6 Repeatability (up to 40 percent overlap error) for 4.0m Training Image 4.4 SCALE AND ROTATION EXPERIMENTS Repeatability of Interest Point Detection for Different Harris-Based Detectors 100 90 80 70 60 50 40 30 Harris Harris-Laplace Multi-Scale Harris-Laplace Scale-Invariant Harris Affine-Invariant 20 10 0 1 1.5 2 2.5 3 3.5 4 4.5 Distance from Camera (m) 5 5.5 6 Figure 4.5 (left) Number of features detected on Mediacube object by Harris-based algorithms, (right) Harris-based algorithm detector repeatability when trained at 4m As can be seen in Figure 4.5 (left), the number of points detected by all Harris based detectors decreases with distance, as the object appearance gets smaller in the image. The single-scale Harris detector always detects the least interest points as only the original image scale is used. In contrast, the Harris multi-scale detector detects all points where the Harris operator reaches a maximum in scale-space, leading to between 3 and 5 times more interest points throughout the whole object distance range. The number of points detected by the Harris-Laplace scale-invariant detector falls between the two, as it only detects points where the Laplacian also reaches a maximum. The affine-invariant Harris detector returns fewer points than the scale-invariant detector as the algorithm discards points which are not also invariant to affine transformations. The difference in repeatability of the Harris-based detectors when trained with a single image of the object at 4m distance can easily be seen in Figure 4.5 (right). Here the single-scale Harris only has a small 2m range (3m to 5m) where points are repeatably detected, centred on the training distance. In contrast, multi-scale HarrisLaplace and scale-invariant Harris-Laplace have an almost constant performance across the scale range. Harris-Affine displays a general improving trend from 1m towards 6m, with a peak around the 4m training distance. For a 4m training image the scale factor ranges from 1:4 at 1m to 1.43:1 at 6m, with 4m being 1:1. Figure 4.6 (left) shows that for the single scale Harris algorithm to achieve equivalent detection repeatability approaching the multi-scale and scale-invariant Harris-Laplace algorithms (over 60% throughout the test range), we must train and match the algorithm at 4 separate distances. The training images distances of 1.1m, 1.85m, 3m and 5m were determined empirically to have both minimum scale overlap between detectors and to maximise the repeatability across the whole scale range. This is equivalent to running the Harris detector algorithm 4 times independently on the image, with the consequent quadrupling of runtime. The mean repeatability of the Harris-based algorithms over the whole distance range while varying the training data distance is shown in Figure 4.6 (right). Here we can see that both the multi-scale and scale-invariant Harris-Laplace algorithms have a stable mean repeatability around 91.22% and 75.12% respectively when the training image distance is varied. However, the training image scale makes a large difference both for Harris and Harris-Affine. The Harris algorithm has the highest repeatability (41.10%) when trained at 5m, and Harris-Affine reaches a peak of 74.98% when trained with images of the object at 6m. 70 Harris Detector Repeatability with Varying Training Image Distances 100 Repeatability (up to 40 percent overlap error) 90 80 70 60 1.1m Training Image 1.85m Training Image 3m Training Image 5m Training Image 50 40 30 20 10 0 1 1.5 2 2.5 3 3.5 4 4.5 Distance from Camera (m) 5 5.5 6 Mean Percentage Repeatability (up to 40 percent overlap error) 4.4 SCALE AND ROTATION EXPERIMENTS Harris-based Detector Mean Repeatability with Varying Training Distances 100 90 80 70 60 50 40 30 Harris Harris-Laplace Multi-Scale Harris-Laplace Scale-Invariant Harris-Affine 20 10 0 1 1.5 2 2.5 3 3.5 4 4.5 5 Training Image Distance from Camera (m) 5.5 6 Figure 4.6(left) Harris detector with multiple training images 1.1m,1.85m,3m,5m, (right) Mean repeatability with varying training distance for Harris-based detector algorithms Figure 4.7 (left) compares rotation-invariant descriptors with rotation-variant descriptors, both with and without multiple training images. As shown by the red dashed line, the rotation-variant Harris-Laplace-SIFT matching result with only a single training image is only ever above 50% within 50° of the training image 0°/360° angle. In contrast, the rotation-invariant descriptor performance was more constant, with a mean average around 60% for the whole 360° rotation. Rotation-variant matching performance was also evaluated relative to multiple training images. In this case 60° rotation increments were determined empirically to produce performance similar to rotation-invariant descriptors, while using the least training images. For 3D objects such as the Mediacube the number of features detected did not change significantly with 3D rotation, as seen in Figure 4.7 (right), due to textured surfaces on multiple faces being visible at each rotation angle. However, for 2D objects such as the notepad or objects where some surfaces lacked strong texture the number of points detected depended greatly on viewpoint. This led to large changes in the detector repeatability and descriptor matching percentages with 3D object rotation. mediacube: Number of Features with object at distance 4m 80 80 70 Harris-Laplace Scale-Invariant Harris-Laplace Multi-Scale Harris Affine-Invariant 70 60 Number of Features Matching Percentage (up to 50 percent overlap error) Mediacube: Matching Percent (up to 50 percent overlap error) with object at Distance 3m 90 50 Harris-Laplace Rotation-Invariant SIFT Harris-Laplace-SIFT with 60° Ground Truths Harris-Laplace-SIFT 40 30 60 50 40 20 30 10 0 0 50 100 150 200 250 300 Angle of Rotation (Degrees) with object at Distance 3m 350 20 -80 -60 -40 -20 0 20 40 60 Angle of Rotation (Degrees) with object at Distance 4m 80 Figure 4.7 (left) SIFT Descriptor matching percentages with 2D rotation, (right) Number of features detected with 3D rotation 71 4.4 Repeatability (up to 40 percent overlap error) 90 80 70 60 50 40 30 20 10 0 -80 Harris-Laplace Scale-Invariant Harris-Laplace Multi-Scale Harris Affine-Invariant -60 -40 -20 0 20 40 60 Angle of Rotation (Degrees) with object at Distance 4m mediacube: Matching Percentage (up to 50 percent overlap error) with object at Distance 4m 80 Matching Percentage (up to 50 percent overlap error) mediacube: Detection repeatability (up to 40 percent overlap error) with object at Distance 4m 100 80 SCALE AND ROTATION EXPERIMENTS 70 60 50 40 30 20 10 0 -80 Harris-Laplace Rotation-Invariant Harris-Affine Rotation-Invariant Harris-Laplace Rotation-Variant Harris-Affine Rotation-Variant -60 -40 -20 0 20 40 60 Angle of Rotation (Degrees) with object at Distance 4m 80 Figure 4.8 (left) Detector repeatability for 3D rotation, (right) Descriptor matching percentages for 3D rotation The repeatability of interest point detection with viewpoint transformation is shown in Figure 4.8 (left). As can be seen, for all algorithms the repeatability is generally at a maximum near the training image orientation (0°) and decreases with increasing rotation in either direction. In this graph the multi-scale Harris-Laplace algorithm performs best. Affine-invariance only leads to a small improvement in mean repeatability over scale-invariance of 37% against 34% respectively, over the whole rotation range. On average, between 20° and 40° from the training image orientation the repeatability has decreased to 50% or under. The matching performance of SIFT descriptors with 3D object rotation is shown in Figure 4.8 (right). As can be seen, all algorithms have almost identical performance irrespective of whether the descriptor is rotation-variant or invariant. 4.4.4.2 Research Questions R2 to R4: Detection with Scaling and Rotation To address questions R2 to R4 with the final 4 detection algorithms we first present the detection results for scale over all objects, followed by the results for 3D rotation: Figure 4.9 (left) shows detection performance over all objects for SIFT local features. It is clearly visible that the percentage of objects detected falls below 50% after 4.5m distance when trained at 1m. For the chosen scale range, the average percentage of objects detected is highest when trained at 2m (M=82.31, SD=14.34). However, the detection percentage varies least across the scale range when trained at 6m (M=74.44, SD=9.32). Figure 4.9 (right) shows detection performance over all objects for the Shape Context algorithm. Detection percentage is lowest when we train at 6m, never increasing above 30% across the scale range. The performance when training at both 1m and 3m both show a downward trend with similar detection performances, however, the highest percentage of objects were detected at when trained 1m (M=56.10, SD=19.71). Figure 4.10 (left) and (right) show detection performance over all objects for Texture and Colour respectively. The results for both experiments show an overall downward tendency for all training distances, with little difference in performance between the training distances. For Texture, the highest percentage of objects were detected with a training distance of 2m (M=72.74, SD=26.59). Here, 50% or more of the objects are detected between 1m and 6m, with the exception of 4.5m to 5.5m where, unusually, the performance drops. In contrast, for Colour the highest percentage of objects are detected when we train at 6m (M=38.05, SD=46.81). The standard deviation for all colour 72 4.4 SCALE AND ROTATION EXPERIMENTS training distances is high, indicating a high variability of detection performance between objects. In fact, 50% of the objects were never detected by colour. Figure 4.9 (left) Detection performance over all objects for SIFT local features at 1m, 2m and 6m training distances, (right) Detection performance over all objects for Shape Context at 1m, 3m and 6m training distances. Figure 4.10 (left) Detection performance over all objects for Texture (Mag-Lap) Multidimensional Histograms at 1m, 2m and 6m training distances, (right) Detection performance over all objects for LAB Colour Histograms at 1m, 3m and 6m training distances. Figure 4.11 (left) shows detection repeatability over all objects at 3m, for SIFT local features, when we rotate in the horizontal plane. The curve is bell-shaped and shows a sharp fall off in repeatability as we rotate objects away from 0°. Around 20° rotation the repeatability falls to around 40% on average, for all objects. This means only 40% of the original training image features are still being detected. This performance is object dependant. For example, the book object has repeatability greater than 40% between 40° and 40°, peaking at 10° (65.25%). In contrast, the barrel’s repeatability is only ever above 40% at 10°, where it reaches its peak (42.11%). 73 100 80 60 40 20 0 -80 -60 -40 -20 100 Percentage of Objects Detected Detection Percentage Repeatability 4.5 DISCUSSION 0 20 40 60 Angle (°) 80 80 Texture 60 Colour Shape 40 Poly. (Texture 20 Poly. (Colour) 0 -80 -60 -40 -20 Poly. (Shape) 0 20 40 60 80 Angle (°) Figure 4.11 (left) Detection percentage repeatability of SIFT (DoG) over all objects when varying 3D rotation angle, (right) Detection performance over all objects when varying 3D rotation angle, for Texture, Colour and Shape, at 3m training and test distances. Figure 4.11 (right) shows the percentage of objects detected at 3m, for the remaining three algorithms. For both Texture and Shape the curves have a bell-shaped trend, similar to SIFT. The colour algorithm varies between 80% and 50% of objects detected, but does not show a decreasing tendency with rotation. 4.5 Discussion This section discusses issues arising from the Scale and Rotation experiments in section 4.4. 4.5.1 Research Question R1 For local feature algorithms, the drawback of a single-scale algorithm becomes apparent, as features are only repeatably detected and matched around the training image scale. The scale where we extract our smart objects appearance makes a big difference and to enable detection throughout the whole scale range for an object such as the Mediacube we need to match features detected from training images at approximately each meter (replicating a multi-scale algorithm). The most repeatable interest point detection occurred with the multi-scale algorithm; however, as can be seen in Figure 4.5 (left) the Harris-Laplace multi-scale algorithm detects many more features than other Harris-based algorithms, so consequently has a higher processing cost. The additional number of descriptors to match also causes a higher probability of mismatches with large scale errors. Both the number of interest points returned by the detectors and the number of mismatches is important, as wide-baseline detection approaches based on feature correspondences require enough correct matches to enable pose calculation when using robust approaches such as RANSAC [Fischler and Bolles 1981]. As RANSAC easily fails with >50% mismatches, there is a trade-off between having enough interest points to ensure pose calculation, and number of mis-matches. Our results agree with those presented by Mikolajczyk et al. [Mikolajczyk, Tuytelaars et al. 2005] which showed that scale-invariant Harris-Laplace performs better than Harris-Affine with large scale changes. This is illustrated with the reduced HarrisAffine performance near 1m in Figure 4.5 (right). 74 4.5 DISCUSSION Figure 4.7 (left) shows it is possible to make use of matching to multiple training images to recover object rotation. However, we need training images with at least 60° increments for rotation-variant SIFT descriptors to perform equal to rotation-invariant descriptors. This increases the number of training images we must match by at least 6 for each 3D rotation of the object. As can be seen when comparing Figure 4.7 (left) and Figure 4.8 (right), the performance of algorithms under affine transformations (such as 3D rotation) is very different to 2D rotation in the camera plane. The use of rotation-invariant descriptors only provides invariance to 2D rotation in the plane of the camera. For 3D rotation multiple object viewpoints must be stored and matched, with a consequent increase in runtime. Unfortunately, when detecting objects moving around a large environment in the real world, objects are likely to move in multiple axes simultaneously. As features disappear and re-appear as an object is manipulated or moved, multiple viewpoints are required to detect the object in all poses. For example, in Figure 4.8 (right) the matching performance with the Mediacube object is only above 50% between the angles of -25° to +35°, which is equal to a 60° useful viewing angle (centred roughly on the 0° training orientation). This result agrees well with those presented by Mikolajczyk and Schmidt [Mikolajczyk and Schmid 2005] and Lowe [Lowe 2004], who recommend a maximum rotation of 60° between viewpoints for 3D object detection. This would require 14 viewpoints to describe a 3D object with a viewing sphere using equally spaced 60° polar intervals (26 for 60° latitude and longitude intervals). The use of non-scale invariant detector and rotation-variant descriptors multiplies this requirement by another 6, for a total of 84 training images to match for just single-scale detection. In contrast, when using scale (or affine) invariant detectors and rotation-invariant descriptors a full viewing sphere only requires training and matching the 14 viewpoints. 4.5.2 Research Question R2 to R4 The results from the experiments to address R2 to R4 show it is important to consider scale and rotation effects when designing a detection system with multiple cues. For example, for scale, we learned that different algorithms perform best when trained at different distances. Shape performs best when trained at 1m, Local Features and Texture are best at 2m and Colour performs best at 6m. Similarly, the algorithms perform differently when we rotate objects. Here, colour does not exhibit a strong decreasing tendency with rotation, suggesting that one viewpoint of a uniformly colourful object may be enough to detect the object in any pose, and only a small number of viewpoints are required for 3D objects with nonuniform surfaces (Swain and Ballard recommend 6 viewpoints [Swain and Ballard 1991]). In contrast, the more bell-shaped curves for the other methods indicate that to detect an object in any pose we need to extract information from many more viewpoints. These results support both our starting assumption that different detection methods behave differently and hence the argument for using multiple detection methods. The fact that detection performance results when scaling and rotating objects were often not close to 100% around the distance or angle where we train our detection system also suggest that some of the objects were either not detected by a particular method or there is a large inter-object detection performance variability with the algorithms. This finding is investigated further in Chapter 5. 75 4.6 4.6 CONCLUSION Conclusion In this chapter we conducted an experimental study to understand impact of object scale and rotation on different detection methods. We first studied in-depth the impact of scale and rotation on local feature algorithms, finding that scale and rotation invariant algorithms provide performance benefits over single scale and rotation-variant algorithms. This enables us to use invariant algorithms to detect objects without loss of discrimination, hence, we chose to use the popular SIFT algorithm in our work to detect objects with the features cue. From the second set of experiments we found that scale and rotation has a large impact on detection performance. An object’s appearance changes as it is rotated in 3D, but the change can be much greater than with scaling, as whole surfaces can appear or disappear. This has important implications for our approach, and we found that with the exception of the colour method we need to train our algorithms with multiple object viewpoints to detect an object in any pose. Similarly, we learned that detection performance varies when the algorithm is trained at different distances and found the best training distances for each algorithm based on our dataset. This knowledge is significant as it enables our framework to extract additional appearance knowledge from the camera image when an object is at the best training distance. 76 5.1 VIDEO TEST LIBRARY Chapter 5 Cooperative Detection The experiments presented in Chapter 4 demonstrated that it is possible for a camera to detect objects at a range of scales and orientations by using the natural appearance cues of colour, texture, shape and features on objects. These experiments were performed using training and test images with plain backgrounds. However, cluttered real world environments with distractions are a large problem for traditional computer vision approaches and can negatively impact detection performance. For example, if two identical objects appear in the camera image, there is no way for the camera system to determine which is the correct object and which is the distraction on its own. In this chapter we look at how cooperation between a smart object and the visionbased detection system allows the sensing capability of objects to be used in the detection process. To achieve this we perform a study exploring cooperative detection between the camera system and smart object. Specifically, we analyse the increase in detection performance achieved when using movement sensing in the smart object. 5.1 Video Test Library Synchronised sensor data and video of each object moving in a cluttered lab environment was captured for the testing library. The objects and equipment used was identical to that described in section 4.3. Video of each object was captured at 10fps for 20 seconds from a fixed camera location at 2.15m height from the ground, with a 25° declination angle. All video was captured with even illumination. The objects were handheld and moved at a constant walking pace (approximately 0.75m/s) in a 15m path shaped like a ‘P’ from first entry through a door 5m from the camera. The object moved around the loop of the ‘P’ towards the camera (the tip is 2m from the camera), then returning to the door (see Figure 5.1 for example images). Figure 5.1 Video Test library images of chemical container (top) and cereal box (bottom). 77 5.2 COOPERATIVE DETECTION EXPERIMENTS The test videos include challenging detection conditions such as scaling, rotation around the object’s vertical axis and motion blur. Distractions were also present (from other objects or areas of the scene with similar appearances) and the limited camera field of view caused partial occlusion in some frames, hence the videos reflect realistic detection scenarios. Video frames with up to 50% of the object occluded were included in the 200 frame analysed. Each of these 200 frames was manually annotated with a 2D bounding box for a ground truth object location. A Particle Smart-Its device in the object sent data from an IEE FSR152 force sensor wirelessly at 13ms intervals. Data was abstracted to simple object moving and object not moving events using thresholds on the results of operations performed on the raw sensor values, which allows us to abstract away from the performance of individual sensor types. Here we calculate the mean value over a window of 20 samples (260ms). The threshold was set empirically so the object generated continuous “non-moving” events when placed on a surface and “moving” events when mobile. 5.2 Cooperative Detection Experiments We hypothesise that there are several measurable benefits from using movement sensing in combination with vision detection: H1) Sensing increases detection performance (i.e. sensing reduces the number of misdetections and outliers). H2) Sensing increases the detection performance of non-discriminative and simple detection methods to the level of the most discriminative and most complex. H3) Sensing increases detection algorithm speed. H4) Sensing reduces the time to first object detection from entry to an environment. To evaluate our approach using multiple detection methods, we also propose a research question: R1) How do algorithm detection performances change with the individual objects? 5.2.1 Design To address hypothesis H1 we designed four between-subjects experiments, one for each algorithm. Movement sensing was the independent variable for this experiment set, with two levels: 1) with sensing and 2) without sensing. The dependant variable is the detection performance, which is the percentage of frames where the object is correctly detected. The second set of experiments addresses hypothesis H2. Here three algorithms each combined with movement sensing are compared to the SIFT local feature algorithm without sensing. The local feature algorithm is used as a benchmark. To address hypothesis (H3) we compare the runtime of all four algorithms with sensing to the runtime without sensing. Movement was again the independent variable, with two levels: 1) with sensing and 2) without sensing. The dependant variable is the algorithm runtime. Finally we conduct a set of experiments to address the last hypothesis (H4). Here the time to first detect the object of all four algorithms with sensing is compared to the time to first detect the object without sensing. Movement sensing was again the independent 78 5.2 COOPERATIVE DETECTION EXPERIMENTS variable, with two levels: 1) with sensing and 2) without sensing. The dependant variable is the time taken to first detect the object from entry into the environment. To address the research question R1, we break down the detection results from the first set of experiments, firstly looking at the detection performances of each object, and secondly the detection performance from combinations of cues. Apparatus was the same as in the section 4.4 experiments. 5.2.2 Training In order to perform the experiments, we first trained all four algorithms. For each object, then for each detection algorithm, an appearance description was trained using the rotation images in the object appearance library (see section 4.3). As the video test library includes rotation around the object’s vertical axis we use multiple viewpoints for detection in the experiments. We assume the bottom surface of objects is not visible, consequently the description was trained with 6 viewpoints from the upper viewing hemisphere with the object at 3m distance from the camera (the centre of our working range) in rotation intervals of 60° around the object’s vertical axis. The object ground truth bounding boxes were used to mask the training images when creating the appearance descriptions of the objects. 5.2.3 Procedure For all experiments the 4 algorithms were run on the 200 frames of each object video in the video test library using the respective object appearance description for detection. As there were multiple viewpoints, the algorithm is run multiple times in each frame with the individual viewpoints. We assume a correct detection from any viewpoint is a detection of the object. We use the same detection procedure as discussed for the 4 algorithms in section 4.4.3.2. The detection result from the algorithms was a bounding box for colour histogram, shape context and multi-dimensional histograms, or a set of feature correspondences for the local feature algorithm. For the first three algorithms, correct detection was assumed when the detection bounding box had <50% overlap error with the ground truth bounding box. For local features correct detection was assumed when a minimum of 8 features were matched to the training image using nearest-neighbour Euclidean distance matching and >50% of feature correspondences were correct. Correct correspondences were established based on a manually annotated ground truth homography transformation between the test image and the training image of the object. Detection algorithm processing time was measured over all 200 frames in the test video using timers with millisecond resolution and the mean time per frame calculated. The time taken to first detect the object was measured by counting the number of frames before the first detection occurs when the object is visible on entry into the environment. These results were converted to time based on the fixed 10fps frame rate we used for the video by dividing the number of frames by 10. The times were mean averaged for each detection method for the two cases: with sensing and without sensing. Objects with no detections in either or both of the two cases are excluded from the analysis of the respective detection method. 79 5.2 COOPERATIVE DETECTION EXPERIMENTS For the multi-cue detection performance, a successful detection was recorded whenever any one of the combination of cues detected the object successfully. This is equivalent to a simple logical OR cue combination. 5.2.4 Results Algorithm Detection Performance (%) 5.2.4.1 Hypotheses H1 and H2 Figure 5.2 shows the detection performance of four algorithms: SIFT local features, Mag-Lap Texture, Lab Colour and Shape Context. The results presented are mean averaged first for each object, then for each algorithm. It is clearly visible that for all algorithms the use of movement sensing increases detection performance over no sensing. These results support hypothesis H1. No Sensing 75.00 Sensing D Error Bars show 95.0% Cl of Mean D Bars show Means 50.00 D D D 25.00 D D D 0.00 Local Features Texture Colour Shape Algorithm Figure 5.2 Graph of detection algorithm results without sensing (orange) and with movement sensing (blue), averaged over all objects. Error bars show 95% Confidence level (Cl) of mean. Without sensing, the Local Feature algorithm has the highest performance (orange bar on the left) of all algorithms, when averaged over all objects in the test video library. On average, the use of sensing (M=41.07, SE=10.82) gives a statistically significant improvement in detection results over no-sensing (M=28.42, SE=9.13) for all objects, with the Local Feature (SIFT) detection algorithm, t(9)=-3.75, p<.05, r=.78. The Texture algorithm shows the greatest increase in detection performance when using movement sensing. The use of sensing (M=57.63, SE=7.71) gives a statistically significant improvement in detection results over no-sensing (M=2.98, SE=1.50) for all objects, with the Texture detection algorithm, t(9)=-7.24, p<.001, r=.92. The colour algorithm has the highest detection performance of all algorithms when used with sensing. The use of sensing (M=63.22, SE=5.92) gives a statistically significant improvement in detection results over no-sensing (M=21.82, SE=9.30) for all objects, with the Lab colour detection algorithm, t(9)=-6.02, p<.001, r=.89. While the performance of the shape algorithm increases with movement sensing, this algorithm has the lowest mean performance when used with sensing. On average, the 80 5.2 COOPERATIVE DETECTION EXPERIMENTS use of sensing (M=33.03, SE=5.69) gives a statistically significant improvement in detection results over no-sensing (M=5.52, SE=1.52) for all objects, with the Shape Context detection algorithm, t(9)=-5.93, p<.001, r=.89. When using movement sensing, all algorithm performances are higher than the performance of local features without sensing. This supports hypothesis H2. 5.2.4.2 Hypothesis H3 Table 5.1 Mean algorithm runtime per frame, averaged over 200 frames Method No Sensing (ms) Local Features Texture Colour Shape Movement Sensing (ms) 641 648 307 852 605 447 307 288 Table 5.1 shows the mean time per detection for each of the four algorithms, with and without the use of movement sensing. It is clearly visible that for three algorithms (Local features, Texture and Shape) the use of sensing reduces the mean detection time per frame, supporting our hypothesis H3. 5.2.4.3 Hypothesis H4 Table 5.2 shows the mean time to object detection from first entry to the environment for each of the four algorithms, with and without the use of movement sensing. It is clearly visible that for all four algorithms the use of sensing reduces the mean detection time, supporting our hypothesis H4. For all algorithms and all objects, the mean average time before first detection when using sensing is 0.67s, compared to 3.97s without sensing. The use of sensing both reduced the inter-object and the inter-algorithm variability in detection time, as can be seen by the consistently lower scores in standard deviation with sensing. Overall, the Colour algorithm showed the greatest change with a mean average of 6.78s over all objects before first detection without sensing, while only 0.15s with the use of movement sensing. The lowest change was for the Local Features algorithm, with an average of 1.35s reduction from 2.36s to 1.01s with sensing. The highest variation in detection times between objects was seen when detecting object with the Shape algorithm without sensing (SD=4.39), however, shape shows the largest decrease in standard deviation with sensing (to SD=0.78). Table 5.2 Mean time to detection from first entry to the environment Method Local Features Texture Colour Shape Mean (s) Std Dev Mean Time, No Sensing (s) 2.36 2.30 6.78 4.44 3.97 2.12 Std Dev, No Sensing Mean Time, Sensing (s) 2.33 1.48 1.45 4.39 Std Dev, Sensing 1.01 0.93 0.15 0.58 0.67 0.39 1.73 0.10 0.10 0.78 81 5.2 COOPERATIVE DETECTION EXPERIMENTS 5.2.4.4 Research Question R1 The following Figure 5.3 and Figure 5.4 show the detection results with the four detection algorithms for each object in the test video library. 100.00 100.00 No Sensing Sensing Detection Performance Detection Performance (%) No Sensing Sensing 75.00 50.00 25.00 0.00 75.00 50.00 25.00 0.00 Ball Book Barrel Card Box Chair Cereal Mug Notepad Toaster Ball Book Barrel Card Box Object Mug Chair Cereal Notepad Toaster Object Figure 5.3 (left) SIFT Local Feature Detection Performance with and without sensing for each object in test library, (right) Mag-Lap Texture Detection Performance with and without sensing for each object in test library 100.00 100.00 No Sensing Sensing Detection Performance Detection Performance (%) No Sensing Sensing 75.00 50.00 25.00 0.00 75.00 50.00 25.00 0.00 Ball Book Barrel Card Box Chair Cereal Object Mug Notepad Toaster Ball Book Barrel Card Box Mug Chair Cereal Notepad Toaster Object Figure 5.4 (left) Lab Colour Detection Performance with and without sensing for each object in test library, (right) Shape Context Detection Performance with and without sensing for each object in test library As can be seen in the local features performance in Figure 5.3 (left), the book, notepad and cereal box objects are detected very well, with or without sensing. However, some objects such as the chair, mug or toaster are detected in few video frames. Although the average texture detection performance shown in Figure 5.3 (right) is the lowest of all algorithms without sensing, all objects except the mug achieve performance around 50% or greater when using movement sensing. When detecting the notepad object, texture also produces the highest detection result for any object and any algorithm using movement sensing (detected in 94.44% of all video frames). 82 5.2 COOPERATIVE DETECTION EXPERIMENTS As seen in Figure 5.4 (left), the colour detection performance again varies greatly by object without sensing, similar to local features. Here, barrel, book, chair and cereal objects are detected well, without sensing. However, all objects achieve detection performance higher than 25% with sensing, and this is reflected by colour achieving the highest mean average detection result for all algorithms. The mug in particular has its highest detection performance here, when detecting with the use of movement sensing (42.42% of all video frames). The shape detection performance shown in Figure 5.4 (right) without sensing is higher on average than texture, however, the average with sensing is the lowest of all algorithms. Only the notepad object achieves detection performance higher than 50% when using sensing. 5.2.4.5 Multiple-Cue Combination The results from section 5.2.4.4 are used for Table 5.3 and Table 5.4Error! Reference source not found. to rank each detection cue by its detection performance for each object, without and with movement sensing respectively. The Local Features algorithm proved to be the best single cue for 70% of the objects without sensing, while colour is the best single cue for 70% of the objects with movement sensing. However, more importantly, each of the cues we studied was the best algorithm for at least of one of the objects, indicating that there is potential for further improvement in overall detection performance by using multiple cues. Using the cue ranking from Table 5.3 and Table 5.4, we illustrate the detection performance improvement attained over a single cue when using multiple cues to detect objects in cluttered scenes, such as the video test library (see section 5.1). The overall detection performance achieved by multi-cue combination, together with the breakdown of cue contribution to this performance figure (shown as column colour) can be seen in Figure 5.5 and Figure 5.6. It is clearly visible that by using a combination of the highest performing cue (lowest colour of the column) together with other cues, the overall detection performance improves for all objects in the study, with or without movement sensing. Over all objects without sensing the mean average performance increases from 37.89% (with just the highest performing single cue) to 43.47% with all 4 cues, an improvement of 4.96%. The largest performance increase is for the book, which improves from 71.52% to 86.71% (a 15.19% increase in the number of frames detected). The smallest increase is for the Card object (0.62%). The most frequent combination of two cues was Local Features and Shape, ranked as the best two for 50% of all objects. For objects with movement sensing, the mean detection performance increase was larger (15.38%), with the ball object benefiting from the largest increase (of 26.77%) to 77.78% of frames detected. The smallest improvement was for the Card object (8.13%). The most frequent combination of two cues was Colour and Texture, ranked as the best two for 40% of all objects. 83 Best Cue Method Shape Colour Local Features Local Features Local Features Local Features Colour Local Features Local Features Local Features Method Object Ball Barrel Book Box Card Cereal Chair Mug Notepad Toaster Object Colour Colour Local Features Colour Local Features Colour Colour Colour Texture Colour Best Cue Ball Barrel Book Box Card Cereal Chair Mug Notepad Toaster 2nd Best Cue Method Local Features Local Features Colour Shape Texture Colour Shape Shape Shape Shape Result (%) 5.55 4.76 67.17 2.99 14.38 57.06 10.76 0.63 9.26 0.60 3rd Best Cue Method Texture Shape Shape Colour Shape Shape Texture Colour Texture Colour 5.2 47.53 75.60 72.22 52.69 50.63 81.60 65.19 6.33 82.72 61.90 Method 44.44 35.71 48.43 40.12 46.25 78.53 33.54 4.43 64.20 7.14 Result (%) Local Features Shape Shape Shape Colour Shape Local Features Texture Colour Shape Method Worst Cue Texture Local Features Texture Local Features Shape Texture Shape Local Features Shape Local Features 3rd Best Cue Worst Cue Method Colour Texture Texture Texture Colour Texture Local Features Texture Colour Texture Result (%) 12.96 27.38 38.61 25.15 25.75 37.42 0.63 4.43 61.61 6.55 Result (%) 0.00 0.59 0.00 0.00 0.00 2.45 0.00 0.00 0.00 0.00 COOPERATIVE DETECTION EXPERIMENTS Result (%) 3.70 1.19 3.80 0.00 12.50 2.45 0.63 0.00 8.02 0.00 Table 5.3 Ranking of Cues by Detection Results for Each Object, No Movement Sensing Result (%) 11.11 28.28 71.52 31.74 53.13 61.35 60.60 1.27 53.09 1.79 2nd Best Cue Method Shape Texture Colour Texture Texture Local Features Texture Shape Local Features Texture Result (%) Table 5.4 Ranking of Cues by Detection Results for Each Object, with Movement Sensing Result (%) 51.01 81.31 82.91 63.63 62.50 85.27 77.78 42.42 94.44 71.21 84 5.3 DISCUSSION Figure 5.5 Detection Performance improvement using multiple cues, No Sensing Figure 5.6 Detection Performance improvement using multiple cues and Movement Sensing Using a combination of the three best cues produces much less additional benefit. For objects without sensing a third cue only produced a mean increase of 0.62% over the use of the two best cues, with the largest increase being an additional 2.47% of detection performance for the ball. Objects with movement sensing again benefit more, however the increase is again much less than the improvement gained over a single cue, when using the two best cues. The mean improvement is 2.10%, with the ball again benefiting the most with an 8.03% increase in performance. The Barrel, Book and Chair did not increase in performance by adding a third cue. The addition of a fourth cue did not improve detection performance in any object, with or without movement sensing. 5.3 Discussion This section discusses issues arising from the Cooperative Detection experiments in section 5.2. 85 5.3 DISCUSSION 5.3.1 Single Cue Detection 5.3.1.1 Hypothesis H1 For hypothesis H1 we compared the means of detection performances. The results indicate that performance significantly increases in all algorithms when detecting smart objects with basic movement sensing. Consequently, in our experiments the use of unobtrusively embedded sensing makes the detection process more robust. We believe this performance increase can be generalised to other detection algorithms, as it was observed for all four cues of the object’s appearance. Consequently we suggest that smart objects should try and include at least one type of movement sensor. A range of basic movement sensors are available, such as accelerometers, gyroscopes, ball switches, tilt switches, and force sensors. 5.3.1.2 Hypothesis H2 To address hypothesis H2 detection performance of less complex algorithms (texture, colour and shape) with sensing was compared to local features without sensing. The results suggest that less complex detection algorithms used with sensing achieve better performance than local features. Consequently, smart objects with movement sensors can use a less complex method to be detected with no loss in performance. The runtime of less complex detection methods also tends to be faster, as seen in Table 5.1. 5.3.1.3 Hypothesis H3 To address hypothesis H3 we compared the runtime of the four detection algorithms with sensing against runtime without sensing. While the absolute runtimes are a function of the algorithm implementation, CPU speed and image size being processed, we showed that three algorithms (texture, shape, local features) increased measurably in speed with the use of context information from movement sensing. This movement mask constrains the detection by enabling areas without movement to be discarded from the processing, hence increasing overall processing speed. The Lab colour histogram implementation did not increase in speed as the current implementation of the method only uses the movement sensing events to mask the result of the colour detection step into moving and non-moving areas. This mask operation is performed in under 1ms. Reimplementation of the colour detection algorithm would allow the movement mask to be taken into account. Algorithm speed could be increased further for all algorithms by using information from previous frames to predict an object’s location in the current frame, or by using guided matching where location hypotheses from different detection method are fused (both methods constrain the search space with a region of interest). 5.3.1.4 Hypothesis H4 To address hypothesis H4 we compared the time to first detect an object following entry to the environment for each of the four algorithms with sensing, against time to detection without sensing. The results shown in Table 5.2 indicate that sensing provides additional benefit by enabling objects to be detected faster than without sensing. For example, the 0.15s time for Colour detection and low standard deviation of 0.10 when using sensing indicates that objects are typically detected within the first few frames of becoming visible to the camera (each frame is captured at 0.10s intervals). 86 5.3 DISCUSSION The reduction in detection times is likely due to two reasons. The first is that the use of movement sensing constrains the detection process, allowing a faster overall detection. However, if we compare the results against the detection performance results in Figure 5.2, we infer that a secondary reason for the reduction in detection time seen in Table 5.2 is likely due to the absolute overall increase in detection performance with sensing. Here, the objects are detected in more frames overall and this effect occurs in all the videos; hence this will contribute to an average reduction in detection time. 5.3.1.5 Research question R1 We posed research question R1 to look at how our object dataset (shown in Figure 4.1) was detected with different algorithms. The test videos include challenging detection conditions (scaling, rotation, distractions, motion blur and partial occlusion), hence the results reflect expected real-world performance. Location prediction between frames with Kalman or Particle filters is expected to improve all results further. In the local features performance, the chair, mug and toaster have few detections because of their appearance. The chair has plenty of corner features, however many look identical due to the geometrically repetitive structure. Similarly, objects with repeating patterns on their surface, such as the ball, also exhibit high numbers of mismatches. Local features use the area around each detected corner for calculating a descriptor. Consequently, the structure of the chair again causes poor matching results, as features detected at object boundaries include a lot of background. Plain objects such as the toaster and small objects, such as the mug only have a small number of features, so are difficult to detect reliably. For texture and shape, all objects had a poor performance when used without sensing due to the cluttered environment in the video test library. Here other objects and surfaces with different appearances but similar amounts texture or similar shapes distract the algorithms. Sensing constrains the search area, causing less distraction and higher performance. The colour algorithm performs best with uniform and brightly coloured object appearances. The low detection rate of the barrel without sensing (28.28% of frames) is due to other chemical containers in the background distracting the algorithm. Movement sensing masks the distraction and increases detection to 81.31% of the video frames. Table 5.5 Guidance for which method to use, based on object appearance type Object Appearance Type Small Objects Large Objects Plain Objects or Objects with Repeating Pattern Non-Repeating Pattern or Textured Objects Saturated Colourful Objects Muted Colour or Black and White Objects Simple Geometry Complex Geometry or Geometry with Holes Best Method Colour Features / Texture / Shape Worst Method Texture / Features Colour / Shape Texture / Features Features / Texture Colour / Shape Colour Features / Texture / Shape Depends on appearance. Colour / Texture / Shape Colour Features 87 5.3 DISCUSSION Colour with sensing was the only algorithm to detect the white mug in more than 40% of the video frames. Movement sensing masked the static white wall, responsible for distraction. The small size and plain appearance of the mug made it a particularly challenging object, almost never detected with other algorithms. These results underline the fact that a single algorithm cannot reliably detect all objects, and validates our approach using multiple cues and method selection. In practice the results suggest when certain algorithms should be used to achieve the best detection performance. This knowledge can be incorporated into our method selection step, however, the suggestions do not take into account external factors such as the object’s context (for example, its background). 5.3.2 Multi-Cue Combination The performance benefits from multi-cue detection are shown clearly by Figure 5.5 and Figure 5.6. As the cue rankings changed with objects and performance benefits are seen when using multiple cue detection, we infer that the four cues we use in detection are complimentary. When using all 4 cues without movement sensing the objects are on average only detected in 43.47% of all the video frames, which is a very low detection performance for a system which aims to be deployed in a real-world environment. These detection results are also significantly lower than the results reported for the algorithms in [Swain and Ballard 1991; Schiele and Crowley 2000; Belongie, Malik et al. 2002; Lowe 2004], however, the reported results were typically for objects on plain backgrounds or with dissimilar distracting objects. The use of movement sensing with multiple-cues on our approach makes the detection robust to the clutter and distraction seen in “realistic” environments. Over all objects the mean average detection performance is now increased to 88.72% of video frames, when using all 4 cues. All objects receive an increase in detection performance by using the two best cues for detection, by a mean average of 15.38% when used with movement sensing. However, increasing the number of cues considered beyond two gives little additional benefit (2.10% for 3 cues) and the addition of a fourth cue did not improve detection performance in any object.. The multi-cue results presented are a best-case scenario, where appearance knowledge is available for all cues and the ranking of cues is known a-priori. We may not initially know the ranking of cues in new objects to achieve best detection performance, however, in practice the detection system can maintain detection metrics for each object and method combination. This valuable knowledge can be re-embedded into the smart object so it is not lost when the object leaves the environment. Unless multiple cues can be implemented to run in parallel, multi-cue detection is always a trade-off. Each extra detection algorithm causes a corresponding increase in total run-time per camera frame. Consequently, the detection performance increase must be balanced against the processing required in the projector-camera system. Table 5.1 shows that algorithm runtime can be decreased by the use of movement sensing for 3 of the 4 algorithms, and comparing Figure 5.5 and Figure 5.6 we see a larger effect from multi-cue detection when used with sensing. This suggests that when a smart object contains a movement sensor and has appearance knowledge for a minimum of two cues, the detection system has the potential to run both algorithms and combine the results 88 5.3 DISCUSSION with little penalty in terms of runtime, to achieve a further performance increase over sensing alone. Sequential combination of cues also allows location hypotheses from the first cue to be used when processing the second. For example, if the colour detection method returns two possible locations when searching for a blue chemical container, we can choose to constrain our processing of the second cue to just these image areas, giving both a reduction in runtime and increased chance of detection with the second cue. However, detection using sequential cue combination is more liable to exhibit catastrophic failure, due to detection failures in early cue methods. Hence democratic integration (parallel combination) could be used for robustness and sequential combination for accuracy or speed. 5.3.3 Limitation of Movement Sensing The results support all three of our hypotheses on the benefits of the use of movement sensing in combination with vision-based detection. However, there are some limitations to an approach only involving movement. To detect static objects, the only way we can use movement sensing is to exclude moving areas in the image from the detection process. While this confers some similarities to the scenario where there is a moving object and static background, the detection performance will vary between performance without sensing and with sensing, depending on the size of the moving area in the image. In this case, greater knowledge of the object’s context from other sensors may help us detect the object, for example, using an embedded electronic compass or any other available location system. If the object has external light sensors the structured light approaches taken by [Lee, Hudson et al. 2005; Summet and Sukthankar 2005; Raskar, Nii et al. 2007] would also be possible. 5.3.4 Use of Sensing in smart objects for Pose Calculation Following the initial detection process, the location and orientation (the pose) of a smart object must be calculated before a projection can be established. The space that the Projector-Camera system and smart objects exist in is modelled in a 3D coordinate system with a single world origin. Local coordinate systems exist for all projectors, cameras and objects. For example, objects can be modelled with the centre of the object as their local origin and their front surface parallel to the X,Y plane, however, with known transformations we can convert between any arbitrary coordinate systems. Pose calculation aims to recover all 6 degrees of freedom of the object to camera coordinate system transformation (see Figure 4.4, section 4.4). This comprises the location of the object origin in X,Y,Z camera coordinate system (tx,ty,tz) and rotation of the object local coordinate system relative to the camera X,Y,Z axes (rx,ry,rz). If a new object enters the environment it can be in an arbitrary location and orientation. For example, a user who enters with a cubic object held in the palm of their hand could have any of its 6 surfaces upwards. The location tx,ty,tz can be estimated both from the centre of detection in camera coordinate system and the apparent size of the detection area combined with the known surface size from the 3D model stored in the smart object. We model the orientation transformation as P, where rx,ry,rz ∈ [0, 2Π). In order to determine P we have to search in ℜ 3 , a search space of [0, 2Π) * [0, 2Π) * [0, 2Π) 89 5.4 CONCLUSION possibilities. However, if the object contains a 3D accelerometer sensor, we can make use of the sensing ability to help us recover the object pose. 3D Accelerometers can be easily calibrated to measure their orientation relative to gravity using methods such as that proposed by Krohn et al. [Krohn, Beigl et al. 2005]. The smart object can also use machine learning methods and be trained to detect its orientation using the method proposed by Van Laerhoeven et al. for smart cubes, with little cost (20 extra multiplication and 5 addition operations per 3D accelerometer reading) [VanLaerhoven, Villar et al. 2003]. We assume the camera orientation relative to gravity is known, either from a pan and tilt unit with known mounting orientation, or by using an inclinometer sensor, as proposed for handheld projectors by Raskar et al. [Raskar, Beardsley et al. 2006]. These methods therefore allow direct calculation of 2 of the 3 rotation parameters of the vector P directly from the accelerometers. The third rotation (rotation around the gravity vector) cannot be detected with accelerometers. Using this knowledge we reduce our search space from ℜ 3 to ℜ , as we only need to recover 1 unknown rotation in P, a search space of just [0, 2Π). If an object contains electronic compasses or gyroscopes, attempts could also be made to detect the third rotation axis. However, a minimum of 3 orthogonal compasses would be required as orientation errors increase significantly with any tilt. 5.4 Conclusion In this chapter we conducted an experimental study, to explore cooperative detection between the visual detection system and smart object. We found that detection performance increases when using basic movement sensing with all four detection algorithms. Movement sensing constrains the search space, reducing distractions from the cluttered real-world environment. The improvement was observed for all algorithms, suggesting this can be generalised to other detection algorithms. This is a significant result for our framework, as it means any objects with embedded movement sensors can achieve more likely visual detection by cooperating with the projector-camera systems and sharing their sensor information. In addition, we found that with the use of movement sensing, simple algorithms achieve similar or better detection performance to complex detection algorithms and that the runtime of 3 of the 4 detection algorithms in the study was reduced. This has important implications for the projector-camera systems in our framework, as it means overall detection performance can be maintained or improved either while reducing the amount of processing power required for detection, or that detection speed can be increased. This allows more flexibility in object detection. For example, when objects share their sensing information we can increase the chances of detecting and tracking a moving object by using the fastest algorithms. Similarly, we can either increase robustness in detection or detect more than one object simultaneously by running multiple detection methods in parallel, with no loss in detection speed. None of the algorithms had a high detection performance for all objects in our dataset. This is another significant result for our framework, as it underlines the fact that a single cue cannot reliably detect all objects. Consequently, this indicates it is worthwhile supporting multiple detection methods and validates the approach we propose in the Cooperative Augmentation framework. 90 5.4 CONCLUSION In this study the most challenging objects to detect were small, plain objects, which were best detected with colour and movement sensing. The best objects were highly textured (a non-repeating pattern or unique features) or had uniform saturated colour. Multi-cue detection was demonstrated to improve detection performance for all objects when using the best 2 cues; however, little improvement was seen when using more than 2. This effect is again down to the variable natural appearance of objects. Improvements in detection performance were most pronounced when using movement sensing. The time saved in detection when using movement sensing can be traded off with processing of additional cues, enabling a projector-camera system in our framework to improve detection performance while maintaining the original single cue detection speed. In conclusion, the results presented in this chapter indicate that the combination of different sensing modalities is important in detection. Smart objects that cooperate with projector-camera systems and share their embedded sensing information can benefit directly with faster or more robust detection and pose calculation. These findings are used to inform the design of a system architecture for Cooperative Augmentation in Chapter 6. 91 6.1 ARCHITECTURE DESIGN OVERVIEW Chapter 6 System Architecture This chapter describes a system architecture which serves to validate the feasibility of the Cooperative Augmentation approach proposed in Chapter 3. The design of the architecture integrates all components of the Cooperative Augmentation framework and is directly informed by the results from the visual detection experiments in Chapter 4 and Chapter 5. This enables the architecture to serve as a platform for the demonstration applications presented in Chapter 7. We first discuss the design itself, in terms of the components which comprise the architecture, followed by the implementation details, and then discuss the issues which arise in practice when implementing the design with a set of smart objects and steerable projector-camera systems. Finally, three series of experiments are presented to characterise the performance of the architecture in terms of the detection method memory requirements in the smart object, the accuracy of pose calculation, the magnitude of jitter in pose calculation and a combined system evaluation of the overall projection accuracy on the smart objects. 6.1 Architecture Design Overview The architecture is designed to enable detection, tracking and interactive projection on a smart object’s surfaces. As discussed in section 3.3, the framework aims to enable serendipitous use of any projector and camera hardware distributed in the environment in the augmentation process. Hence, the architecture is designed to support multiple projector and camera systems by using a distributed client-server design. To achieve this distributed design we assume our system has four aspects of knowledge about the physical hardware: 1. Location and orientation of all projectors and cameras, as discussed in section 3.3 2. Calibration information about the intrinsic optical characteristics of the cameras and projectors, as discussed in sections 6.2.1 and 6.2.2. 3. Calibration of any pan and tilt unit used in a steerable projector-camera system, as discussed in section 6.2.3. 4. Knowledge of the pairing of distributed projector and camera hardware in the environment to form projector-camera systems. The cooperative augmentation conceptual framework maps to a physical structure of mobile smart objects and projector-camera systems in a ubiquitous computing environment, as seen in Figure 6.1. A discovery mechanism allows smart objects to 92 6.1 ARCHITECTURE DESIGN OVERVIEW initially discover services offered by projector-camera systems, and register for the services. We assume the objects use an existing service discovery mechanism, so this component will not be discussed further in this thesis. Database Server Discovery Service Detection and Location Service Detection Camera Projection Service Pan & Tilt Tracking Pan & Tilt Tracking Knowledge Updating Projection Object Proxy Projector Smart Object Smart Object Knowledge Sensing Figure 6.1 Distributed System Architecture Overview A detection and tracking service exists for each camera in the environment. The service is composed of the two core components Detection and Knowledge updating. If a camera is attached to pan and tilt hardware, Pan and Tilt Tracking forms the third optional component of the service. A projection service exists for each projector in the environment. The service is composed of the core projection component and a second optional Pan and Tilt Tracking component if the projector is attached to pan and tilt hardware. The detection and projection services may share a Pan and Tilt Tracking component if they are part of the same physical steerable projector-camera system. Physical devices such as smart objects are represented by proxy components in the system architecture. We assume proxies, services and applications in our architecture send and receive messages in a common message format, using a common protocol on the network. This design ensures that the system architecture can be distributed effectively while abstracting from the hardware and allowing proxy components to be easily exchanged. The Database Server component is a single world model maintained on the network, supporting services and applications on top of its model. The world model contains knowledge of all projectors, cameras and smart objects in the “environment”. In the real-world this “environment” maps to an area of physical space visible to the projectorcamera systems contained within it, which supports projected displays on smart objects. The Object Proxy component is responsible for maintaining coordination of state between a physical smart object and the cached knowledge in the Database Server component. 93 6.1 ARCHITECTURE DESIGN OVERVIEW 6.1.1 Architecture Novelty The architecture components themselves are based on several well-established concepts discussed in related work Chapter 2, such as the natural appearance detection and pose calculation algorithms used in the detection component, or the geometric, photometric and colorimetric correction algorithms in the projection component. Additionally, our architecture incorporates the six novel aspects described below. The first is the embedding of the knowledge required to achieve a display in to the object. This removes the need to store information about all possible objects which can enter the environment, or manually update the system whenever a new object appears. This avoids creation of large databases of objects which must be searched in detection, reducing the possibility of misdetection. Instead, the initial smart object registration step makes the object detection process simpler, as the system knows exactly which objects are present in the environment and hence, what to search for. The second is that abstraction enables flexibility in our architecture. By abstracting the detection and projection process to services in the environment we enable use by any type of smart object and any projector or camera hardware. Similarly, by abstracting object movement sensors to generic moving or non-moving events we enable use of any type of sensor able to detect movement. For example, accelerometers, gyros, ball switches, tilt switches or force sensors which detect pick-up and put-down events. Here we only care about detecting basic motion, allowing us to abstract away from the specific performance or calibration requirements of individual sensors. The third is that our system is flexible and adaptable to the knowledge contained in the object due to a dynamic tailoring process. This process occurs in two situations: firstly, in dynamic multi-cue detection algorithm selection based on knowledge embodied in the smart object, its sensing capabilities, the object context and its current detection status. Secondly in dynamic projector geometry correction algorithm selection based on knowledge of object geometry embodied in the smart object. The fourth is that objects monitor their own appearance and form via embedded sensing and update the projector camera system when these change. For example, an articulated object such as a book changes both appearance and form when opened, but with embedded sensing it can detect the opening and automatically updates the projector-camera system with a new appearance description and 3D model geometry; hence, tracking continues uninterrupted. The fifth is a dynamic projector and camera pairing method to support multiple projectors and cameras distributed in the environment. This process allows serendipitous cooperation between projector and camera services in our architecture to achieve the best display possible on smart objects in the environment. The final aspect is that over time camera systems automatically extract more appearance knowledge about objects and re-embed this into the smart object. Even if the object is already detected reliably with one detection method, extracting more knowledge is beneficial as the environment can also change. For example, distracting objects are introduced with similar appearances, the wall is painted a different colour, or the object is simply taken to another room. Similarly, new detection algorithms can be easily deployed in the projector-camera system, as the object appearance knowledge will be automatically updated with the new appearance information. 94 6.2 ARCHITECTURE COMPONENTS 6.2 Architecture Components The components of the architecture are described in more detail in the following sections. 6.2.1 Detection and Tracking The design of the detection system architecture is informed directly from the results presented in Chapter 4 and Chapter 5. We design multi-cue detection system with the different detection methods (shown in Table 6.1), corresponding to different appearance cues of the object (colour, texture, shape and features of objects). Table 6.1 Appearance knowledge levels and detection methods with associated processing cost Appearance Knowledge Colour Texture Shape Local Features Detection Method Colour histogram comparison Multidimensional Receptive Field Histograms Contour detection and Shape Context Interest point detection and feature descriptor comparison Discriminative Power Cost in Time Low Low Medium Medium Medium Medium High High These methods form a flexible layered detection process that allows an object to enter the environment with different levels of appearance knowledge. As we descend the table, the power of the detection methods to discriminate between objects with similar appearances increases, however, at the cost of increased processing time (due to increasing algorithm complexity). We consider higher discriminative methods to hold more knowledge about the object. 6.2.1.1 Multi-Cue Detection Method Selection The detection method selection step forms a novel part of the visual detection pipeline shown in Figure 6.2. Here, following each camera frame acquisition the method selection step is performed based directly on the appearance knowledge embedded in the object, its embedded sensing capability and its visual context obtained from the background model. If an object holds only knowledge of a single cue, the respective natural appearance algorithm is automatically executed. Additional appearance knowledge can be subsequently extracted by the knowledge updating component once the object is detected and its pose calculated, as discussed in section 6.2.7. Where an object holds appearance knowledge of more than one cue, the design allows either a single cue to be used, or multiple cues to be fused for improved detection performance (as demonstrated in section 5.2). However, detection method selection is always a trade-off, as processing multiple methods sequentially and the more discriminative individual methods (such as local features) both share the cost of increased processing requirements. 95 6.2 ARCHITECTURE COMPONENTS Camera Image Acquisition Background Model Appearance Knowledge Method Selection Local Features Shape Texture Colour Movement Smart Object Sensing 2D Location Pose Computation 3D Model 3D Location and Orientation Hypothesis Figure 6.2 Detection method selection based on smart object knowledge To inform the architecture of the cues most likely to detect the object we maintain detection metrics, which are updated each time a detection process is executed. We store total and mean average algorithm runtime and detection performance histories. The detection performance metrics are only used following first detection of an object and only updated when the object is potentially visible (i.e. its last location was inside the camera FOV). These cue performance metrics also re-embedded into each object to allow use of the accumulated knowledge in other environments. When using only a single cue we can choose to prioritise one of three aspects of the detection process: 1. Detection speed, by using the fastest method when objects are moving 2. Accuracy, by using the detection with the highest detection rate when objects are static. 3. Robustness, by using the most discriminative (least abstract) information to increase the probability of detection when distractions are present. When multiple cues are available we need to decide which cues should be combined. To perform this method selection we rank the detection methods based on suitability for detecting a particular object. The ranking is based on three aspects: 1. When an object is moving we rank algorithms with shorter average runtimes higher, whereas algorithms with higher detection performance are ranked higher for static objects. 2. We take the object’s context into account, by looking at the background model and the object’s movement sensing capabilities. For example, we can directly compare the object’s colour histogram with the colour histogram of the area of the background model currently visible to the camera. If the histograms are similar we would not select the colour method unless the object had movement sensors and was mobile, as the probability of detection is low due to distraction (see the Cooperative Detection experiment in Chapter 5). 96 6.2 ARCHITECTURE COMPONENTS 3. Objects can store optional knowledge of their detection performance with scale and rotation (as explained in section 4.4). This allows us to select a method which performs best at the current distance and orientation, following first detection of the object. 6.2.1.2 Pose Calculation If the object is successfully detected a 2D location result is generated. This can take the form of correspondences between extracted image features and features in the Object Model, or a 2D image region in which the object has been detected. The pose calculation step is subsequently performed to calculate the 3D location and orientation of the object with respect to the camera. The object pose can then be converted into a world coordinate system pose using the known camera location and orientation. This, in turn, allows the projection service to later convert the object world coordinate system pose to the pose of the object with respect to the projector using the known projector location and orientation, as shown in Figure 6.3. Figure 6.3 The 4 coordinate systems: Camera, Projector and smart object Local Coordinate Systems, and the arbitrary World Coordinate System The object pose is calculated either directly from matched local feature correspondences, or by fitting the 3D model to edges detected in the 2D image region from the detection step. This step requires both the geometrical 3D model of an object and the camera intrinsic parameter matrix obtained from the camera calibration. If the camera has a computer controllable powered zoom lens then the system has the ability to either detect many objects in a large area with the wide angle setting, or zoom in to increase the accuracy of detection for a single object or closely spaced group. The pose calculation algorithm requires the camera calibration to calculate the pose, however, this changes with the focal length when the lens is zoomed. Consequently, we can pre-calibrate the camera at multiple focal lengths, load these calibrations in the architecture, automatically fit a linear function and use this for interpolating the calibration matrix between the calibrated positions, based on the current zoom setting. 97 6.2 ARCHITECTURE COMPONENTS 6.2.2 Projection The design of the projection component enables undistorted images to be projected by correcting geometric distortion in the image resulting from projecting on a nonperpendicular or non-planar surface, as discussed in section 2.3.6. We compensate for this distortion by warping our projected image, as we know both the surface geometry of the object and the 3D pose of the surface with respect to the projector. We obtain the surface shape from the geometric 3D model embedded within the Object Model and the pose of the object from the detection process. The surface shape directly configures the geometric correction method used in projection [Bimber and Raskar 2005], as shown in Table 6.2. Table 6.2 Geometric correction methods for projection based on object geometry [Bimber and Raskar 2005] Object Geometries Planar Rectilinear Cylindrical Spherical Irregular Correction Method Planar Homography Quadric Image Transfer Discretised Warping The projection service manages projection requests from the smart object. Content is either downloaded from a location supplied by the object, or is loaded directly from the object for projection onto the object’s surfaces. Images of the object’s surfaces from the detection service are used to correct the colour of the projection image to make it more visible on coloured and patterned object surfaces, as discussed in section 2.3.7. The projection component requires the optical calibration of the projector to display projections accurately. This is obtained from the methods described in section 6.4. The projector calibration provides the system with three aspects of knowledge: 1. Horizontal and vertical Field Of View (FOV). 2. The optical Centre of Projection (COP). 3. Native projector resolution in pixels. As discussed in section 2.3.8, in traditional (i.e. non-LASER) projectors an image appears acceptably in focus at the actual focal plane and for a limited distance in front and behind this distance. Outside this distance the projection appears blurred and difficult to read. Consequently, for environments with mobile objects (or objects with large depths) we must dynamically focus on objects to ensure a readable projection. If the projector hardware contains a computer controllable powered focus lens, the focus setting must be calibrated manually at multiple distances, by moving an orthogonal planar surface in steps from the minimum to maximum focus-distance and at each step focussing and recording the focus setting. The architecture can then load this calibration data and fit a non-linear function automatically, to be used for interpolating a dynamic focus based on the distance to a single object, or average of a group of objects. Similarly, we would like maximum resolution projection (from the projector at maximum zoom) on each object, however this restricts both the size of the object and the number of objects that can be projected on, as the field of view of the projector is 98 6.2 ARCHITECTURE COMPONENTS limited. If the projector hardware has a computer controllable powered zoom lens then the system has the ability to trade-off the zoom level (hence projection resolution) with field of view dynamically, so as to encompass the most objects with the highest resolution. To use the zoom, the projector’s optical parameters must be calibrated at multiple zoom settings and these calibrations loaded by the architecture, a linear function automatically fit and used for calculating the zoom setting based on the position of objects in the field of view of the projector requesting a projection. To enable multiple projector-camera systems, we design our projection component using a similar display rights method to Ehnes et al. [Ehnes, Hirota et al. 2005] (see related work, section 2.3.8), which relies on the world model having an overview of all camera, projector and object locations and viewing volumes. Each projector determines the visual quality of its projection based on the distance and relative orientation of the object’s surface to the projector using the metric introduced by Ehnes and Hirose [Ehnes and Hirose 2006]. Closer, more orthogonal projectors score higher, allowing a ranking to be performed and the best projector assigned display rights. This ranking is performed for each surface of an object (as defined in the 3D model) with an active projection, allowing simultaneous projection from multiple projectors onto multiple surfaces of the same 3D object. This method has the benefit of maximising the area of the object which can be projected on, while eliminating the possibility of misregistration from image overlap causing blurred or unreadable projections. 6.2.3 Pan & Tilt Tracking Component The pan and tilt component is solely responsible for controlling steerable hardware, allowing minimal code modification when changing or modifying hardware. We design the component to enable searching for objects using two search methods, and choose between them depending on our knowledge of the object. 6.2.3.1 Search and Tracking Methods The first method is used to detect objects when the system has no prior knowledge of their location. In this case we use a creeping line search to rotate the pan and tilt hardware through its whole mechanical field of view. Figure 6.4 (left) shows this search method. The dotted line represents the pan and tilt mechanical Field Of View (FOV). The length and spacing of the movement track lines is based on both the pan and tilt FOV, and the angular FOV of the camera hardware being rotated, represented by the blue box. The track is designed so that some overlap of camera FOV occurs in the longest axis (here the horizontal pan axis), increasing the likelihood of detecting objects located midway between the track legs. However, the amount of overlap is a trade-off, as increasing overlap requires more tilt legs, hence, more time per search. The second search method is used when the system has prior knowledge of an object’s location in the world coordinate system. In this case the hardware is rotated to point at the object’s last known location and an expanding-box search is used from this point, as shown in Figure 6.4 (right). In this case the movement track is solely based on the angular field of view of the hardware being rotated, represented by the blue box. The track is again designed so that some overlap in field of view occurs and for this pattern we overlap in both pan and tilt axes. 99 6.2 ARCHITECTURE COMPONENTS Start Previous Location Tilt Tilt FOV Pan FOV Pan Figure 6.4 Search Methods: (left) Creeping Line Search over the whole Pan and Tilt Field Of View, (right) ExpandingBox search from previous object location Once detected by the detection system, an object can be dynamically tracked. We assume knowledge of the location and orientation of the pan and tilt hardware in the world coordinate system and knowledge of the object location from the detection system. Consequently, object tracking is accomplished by calculating the relative angle of an object to the pan and tilt hardware origin in the pan and tilt axes, then rotating the hardware to minimise these angles. 6.2.3.2 A Concept of System Focus If multiple smart objects exist in an environment it is possible to simultaneously detect and project on all objects inside a projector-camera system field of view. However, if only one projector-camera system exists and one or more objects are mobile this becomes a resource allocation problem. In this case the system must decide which objects to track. To help resolve this problem we introduce a concept of system “focus”, which is similar to the concept of human attention. The system focus dynamically determines which objects to track and can be set either explicitly by the user, by an application aware of multiple objects, or automatically according to a set of rules. An automatic system focus can be designed in many ways, depending on the amount of knowledge available about the user’s context. For example, we theorise that detection of object movement indicates deliberate interaction by a user. Consequently, we can easily design a system which always focuses on moving objects. However, such a naive function would encounter many difficult situations. For example, if a projection is already established on a group of static objects and another unrelated user moves through the camera field of view carrying another smart object, the projection would follow the unrelated user. Another solution to the problem is to make use of multiple projector-camera systems in the environment, as discussed in section 6.2.2. The system architecture is designed to allow multiple projectors and cameras automatically detect and project on any objects within their field of view. This allows easy load-balancing across multiple projectors, so that an object can still receive a projection (even if it is not the best quality) if an object is inside the field of view of a projector and helps to overcome the problem of projection starvation, where an object never receives a projection as another mobile object is continually in focus. 100 6.2 ARCHITECTURE COMPONENTS 6.2.3.3 Pan and Tilt Hardware Control The pan and tilt tracking component abstracts from direct hardware control. This abstraction layer allows the pan and tilt hardware to be controlled in terms of angles in the world coordinate system, rather than in a hardware-dependant way. As we assume the pan and tilt unit has knowledge of its location and orientation in the world coordinate system, we can calculate the orientation of the pan and tilt hardware using a calibration between the hardware-dependant measurement units used to control the pan and tilt orientation to angles in the physical world. We assume the pan and tilt hardware has an open-loop control system, so that there is no feedback of actual pan and tilt orientation. This requires a simulation of the motion of the hardware to determine the instantaneous location and orientation of the projector or camera attached to it. We simulate the motion of the hardware using a system of equations of motion, as proposed by Ehnes et al. [Ehnes, Hirota et al. 2004]. Two systems of control are possible, based on different hardware implementations: 1. Constant Speed Control, where the projector rotation speed is set (either manually, or by the system architecture) and constant for the whole of a movement, with exception of very short acceleration and deceleration periods. 2. Constant Time Control, where a projector always takes a constant time to perform a movement by automatically varying its rotation speed. This is subject to its maximum rotation speed, which limits the angle it can rotate in a constant time. Angles greater than this require increasing amounts of time. For both control systems the maximum rotation speed will never be reached on very short movements, as the projector acceleration and deceleration phases overlap. However, in practice, the maximum pan or tilt movement speed we can actually use while searching or tracking with a camera is primarily determined by the camera exposure length. Fast movements with long exposures will cause blurring in the camera image unless the object is moving in the same direction at exactly the same angular rate. This blurring can cause loss of tracking and reduced detection performance. Both control systems use the same equations of motion, however, we change the way the velocity variable is calculated for the different systems. To determine the current pan or tilt angle S (t ) we simulate movement individually for each axis using the following equation: S (t ) = S start + V (t − t start ) (6.1) where S(t) is the angle at time t, Sstart is the intial angle and in the constant speed control system V is the signed rotation velocity (measured in degrees per second). In the constant time control system the velocity is proportional to the distance remaining to the destination angle, limited by the maximum velocity. Consequently, V is calculated with the following equation: ⎛ 1 V = (S start − S end )*⎜⎜ ⎝ Tconst ⎞ ⎟ ⎟ ⎠ (6.2) 101 6.2 ARCHITECTURE COMPONENTS where V is the signed rotation velocity, Sstart is the intial angle, Send is the destination angle, and Tconst is the empirically measured constant time for the movement. Ehnes et al. [Ehnes, Hirota et al. 2004] extended this simulation to include initial acceleration by adding an acceleration constant A, which must be empirically measured. This allows us to incorporate the time required to accelerate from the previous velocity to a new velocity. S (t ) = S start + V start (t − t start ) + A (t - t start )2 2 (6.3) where Vstart is the initial velocity set for the constant speed control, or calculated for the constant time control system. While using an acceleration term may provide greater angle accuracy, it in turn relies on an accurate empirical measurement of the acceleration. This can be challenging without using actual sensors, as many pan and tilt units accelerate very rapidly. In the constant speed control system we can calculate Tmotion , the time required for the motion (and hence the time the projector will arrive at the destination by adding the current time) using the simple equation: Tmotion = (S start − S end ) (6. 4) V For the constant time simulation we simply assume the projector has arrived at its destination when t is greater than the constant time of motion. 6.2.4 Smart Objects Smart objects describe themselves and their capabilities through knowledge embedded in the Object Model, contained in the object itself (as discussed in section 3.2). We assume that the smart objects automatically detect availability of a detection and projection service (simulating the availability of a discovery service such as UPnP [UPnP(TM)Forum 2003] ), allowing them to cooperate with projector-camera systems and communicate this knowledge. Many everyday smart objects possess on-board sensing such as light, temperature and movement sensors associated with other purposes. The smart object cooperates with projector-camera systems to communicate this information, enabling serendipitous constraining of the detection process. Finally, the component assumes smart objects do not have access to external location systems precise enough to allow projector-camera systems to register a projected image with the object’s surfaces. In the Cooperative Augmentation Framework it is the smart object which controls the interaction with projector-camera systems. The smart object issues projection requests to control how an output (the projection) changes. Projection directly onto the object is the visual feedback to interaction. The smart object is modelled as a state machine which responds to user interaction based on variations in input values, such as sensor inputs. This modelling is analogous to programming the smart object. 102 6.2 ARCHITECTURE COMPONENTS To “program” the smart object state machine we define a set of states which are constantly evaluated against the sensors. Each state defines one or more sensors together with operations on the raw sensor values (such as calculating the variance, or the mean), minimum and maximum boundary values required to enter the state, and the method of combining results from individual sensor operations, such as boolean AND or OR. Only when the operations performed on the raw sensor values evaluate to within the boundary values set and the required number of sensors (all for AND, or any one sensor for OR) are in range does the state change. A hierarchical tree of sensors, with each branch having its own method of sensor combination can be used to describe complex states. Each Object Model contains three separate classes of states, for three associated object characteristics: 1. States to determine when object is moving or static (and so generate moving events) 2. States to determine when object changes appearance or geometrical shape (for example, for articulated objects to update the architecture with a new appearance and 3D model) 3. States to determine when projections are requested. One state must be defined per projection. 6.2.4.1 Input Modalities for State Change Seven separate input modalities have been identified for use in modelling interaction with smart objects to enable the three separate classes of states to be programmed. We cluster the modalities into direct interaction (i.e. the user is physically touching the object) and indirect interaction: Direct interaction that can be sensed by the object: 1) Manipulation of object location and orientation, as sensed by the camera. 2) Manipulation of object geometry (for example, opening a book). 3) Manipulation of physical interaction components on and sensed by object (for example, direct interaction with buttons or dials on its surfaces). 4) Other manipulation of object, sensed by object (for example, shaking detected by an embedded accelerometer sensor). 5) Interactive Projected User Interfaces (sensed via a camera). Indirect interaction that can be sensed or used by the object: 6) Manipulation of physical environment remote to object (for example, switching the light on in the room). 7) Interaction with other smart objects in the environment. All these input modalities are treated as sensors in the system architecture. Consequently, the sensor knowledge contained in the Object Model varies depending not just on the sensors physically embedded within the smart object, but also on what modalities we want to use for input in our smart object program. Physical and projected user interface components, such as buttons, dials and sliders, are also modelled as sensors. Buttons are modelled with a range of 0-1, with 2 possible values of 0 (not pressed) and 1 (pressed). Dials and sliders are modelled as a continuous 103 6.2 ARCHITECTURE COMPONENTS value range between 0-100. Multiple axis controls are modelled as separate single axis controls. For example, a 2D joystick would be modelled as 2 sliders with separate physical axes. Additionally, the current projection can be used as a sensor. As smart objects are modelled with a state machine, all possible projections must be known a-priori. Consequently, each individual projection can be numbered and used as an input sensor value when programming an object’s states. For example, if projection image number 2 contains a button, we only want to detect whether the button is pressed when this projection is being displayed, as the other projections may not include buttons. While this requires us to model what is projecting in any individual state, it does not restrict what can be projected. For example, a video can be modelled as one projection in a single state, rather than a collection of individual numbered projection frames, each with its own state. 6.2.4.2 State Machine Processing A generic pseudo code example of the processing required to determine when a state change occurs, and hence, when an event message is generated is shown below: foreach defined state s foreach sensor n in state s Run Operation defined for sensor n on raw sensor data if operation result > min limit and result < max limit sensor n result = 1 else sensor n result = 0 endif endfor if result combination method == AND if all n results == 1 Change to state s End endif else if result combination method == OR if any of n results == 1 Change to state s End endif endif endfor The state processing example above does not take into account hierarchical trees of sensors; however, this is supported simply by recursion of the processing algorithm. In this case the nested processing return the result value of the processing rather than 104 6.2 ARCHITECTURE COMPONENTS change states directly, and the higher levels in turn use these results as sensor values in their processing. In addition to the demonstration applications presented in Chapter 7, three simple examples of programming a smart object using the state-machine approach can be found in Appendix B. 6.2.5 Object Proxy The object proxy translates messages between the native format of the smart object hardware and the architecture message format. By using a standard message format in the architecture different to the smart object format, this allows us to substitute different object hardware and only have to re-write the proxy. Unknown messages received from the smart object by the Object Proxy are converted to the architecture message format and broadcast on the network. Similarly, unknown messages addressed to the smart object received from the architecture by the Proxy are converted to smart object format and forwarded to the smart object. This automatic conversion process allows the object to implement functionality outside of our defined system architecture, with the only requirement being that other communicating applications use our architecture message format for communication. Similarly, it allows the smart object to decide where the abstraction of raw data values to events is performed, as events generated directly by the smart object are treated as unknown messages. A smart object is modelled as a state machine which emits events and projection requests in response to changes in sensor data, location, orientation or stored knowledge. The object proxy converts any sensor data streamed from the smart object directly to event messages if the smart object is unable to perform this operation due to lack of processing power. The object proxy application itself is generic, with conversion processing configured dynamically for each smart object from the Object Model. As discussed in section 6.2.4, the Object Model stores the description of sensors in the object and the conversion method to events. This conversion method is described in terms of operations (such as calculate the variance, or the mean), minimum and maximum threshold values and the method of combining results from the operations, such as boolean AND or OR. This conversion of data to events can be applied to any type of sensor data. The processing required to determine when a state change occurs, and hence, when an event message is generated is identical that described in 6.2.4. 6.2.6 Database Server The smart objects initially register with the Database Server and provide a Unique ID (UID). This creates the virtual object cache in the server which mirrors state changes of the physical object. Changes to the database state are either broadcast on the network to all applications or sent to the smart object, depending on the source and contents of the message. Other applications can now query the object state directly from the server. Maintaining an Object Model database has four additional benefits: 1. It synchronises access to smart object knowledge by different services and applications 105 6.2 ARCHITECTURE COMPONENTS 2. It coordinates distributed projector and camera hardware, allowing them to each offer separate projection and detection services, while maintaining a unified world model of the environment 3. It allows services to be stopped and subsequently re-initialised with current world state by querying the database. 4. It minimises network traffic to and from the physical smart object Object proxies are responsible for keeping the Object Model updated when the 3D model or appearance of the object changes. The database server additionally caches any display request messages or requests for interactive projected interfaces sent by the Object. This allows architecture applications to re-initialise and regain full knowledge of the object state by querying the database server. 6.2.6.1 Projector and Camera Database Similarly, projectors and cameras register with the Database Server, so it contains knowledge of all hardware in the “environment”. The concept of an environment is introduced here as many projector-camera systems may be deployed throughout a building. The concept itself is similar to the concept of sub-nets in IP address based networking, where small areas of a larger network are effectively segregated. Similarly, our concept has a single database server responsible for a small physical real-world area, or “environment” and a small group of projectors and cameras. This enables our approach to easily scale, allowing simultaneous use of many objects in a whole building. Similarly, if mobile, handheld or wearable projector-camera systems enter the environment this approach also maintains the benefits of hardware flexibility and performance. The server maintains a database of all projectors and cameras present in the environment. When a projection is requested by an object, this database is used to identify the physical detection and projection hardware currently detecting and projecting onto each object surface. We form loosely-coupled projector-camera system pairs from all the available hardware in the environment for each projection requested. We aim to dynamically modify the pairings so that the best projector and camera is always used to increase robustness. Detection and projection can continue as long as one camera and one projector can see the object, even if other hardware is occluded. For example, we can imagine an environment with two projectors and two cameras which form two separate physical projector-camera systems, as each camera is attached to the top of its respective projector. Both projector-camera systems are oriented the same way, but with a small horizontal separation between them. Consequently, there is a large horizontal overlap of the viewing frustums between the projector-camera systems. Inside the overlapped region we can use any (or all) of the cameras and projectors for the detection and projection tasks. If a person walks in front of the object, the actual hardware used will vary depending on which projector and camera is occluded as the person walks past. 6.2.6.2 Dynamic Pairing All cameras detecting a smart object return a location and orientation update to the database server. If more than one camera detects the object the location and orientation hypotheses are integrated to a single predicted location via a Particle filter [Pupilli and Calway 2006]. The predicted location is finally returned to the smart object in a location and orientation update message. 106 6.2 ARCHITECTURE COMPONENTS The Database Server stores an array of the UID of cameras currently tracking the object in the respective Object Model. We assume that the object is not being tracked if its location and orientation has not been updated by a detection service for a user defined time (e.g. 1 second). When an object requests a projection the request is sent to all projectors. Each projector returns a visibility score based on the relative location and orientation of the object’s projection location with respect to the projector. The database server manages the projectors to ensure only one projector is active for each requested projection. The returned values are ranked, and the highest scoring projector has the object set infocus (the concept of system focus is discussed in section 6.2.3). Lower scoring projectors have the object set to out-of-focus. The database server re-evaluates all visibility scores at a user defined interval (e.g. 1 second) and changes the projector’s focus as required. Two strategies are possible to prevent rapid flicking between different projectors with identical scores: 1. A user-configurable delay before change. 2. A re-focus threshold, so the projector chosen remains active until its score drops below a pre-defined minimum, irrespective of the ranking. 6.2.7 Knowledge Updating The design of the Cooperative Augmentation architecture provides a mechanism for re-embedding knowledge into smart objects. The knowledge is extracted by the knowledge updating component and re-embedded into a smart object via the Database Server and Object Proxy. This design has three main benefits. Firstly, it allows the objects to enter the environment with flexible amounts of appearance knowledge. Secondly, it allows projector-camera systems to use multiple-cues in detection to increase detection performance (as shown in section 5.2). Thirdly, new detection methods can be implemented in the projector-camera system for improved detection performance, without explicit modification to the Smart Object. Instead, the knowledge updating component automates appearance extraction and re-embedding in the object. To achieve this automation, the component uses images of the object’s surfaces extracted from the camera image by using the known 3D model and pose of the object when detected. The new detection method can be used in the knowledge update process to extract new appearance knowledge from each surface whenever the surface is at the optimum distance and orientation relative to the camera for training the detection methods (see section 4.4.2). This knowledge is re-embedded in the Object Model for more flexible and more robust detection. Newly extracted knowledge is merged with existing knowledge in the Object Model. If the knowledge is not contained in the object we simply add the knowledge to the Object Model. However, if the object already contains knowledge extracted with the same detection method we have three options: 1. Keep the existing knowledge 2. Replace the existing Knowledge with the new knowledge 3. Merge the existing Knowledge with the new knowledge 107 6.3 IMPLEMENTATION To determine which option to choose we can run the detection method twice per camera frame (once for the old knowledge, once for the new knowledge). At the end of a pre-determined period we decide which knowledge performs best and either keep the old knowledge or replace it. Knowledge can also be merged, however this has dangers as the resultant knowledge may not accurately reflect the object’s appearance if there was a misdetection, or the old and new knowledge were extracted under different conditions. For example, a blue chemical container would appear to have one hue under incandescent illumination (3200K temperature), and a different hue in daylight (6500K temperature) unless the camera is accurately colour calibrated in each case. Extracted appearance knowledge is stored in the Object Model on a per-surface basis, relating to surfaces defined in the 3D model. 6.3 Implementation This section describes a demonstration implementation of the Cooperative Augmentation conceptual framework using the system architecture designed above. As the system will be deployed in a lab environment, we aim for a level of performance to allow the implementation to function as an interactive demonstrator: 1. Based dimensions of the lab environment shown in Figure 6.5, the implementation should aim to detect, track and project onto objects up to a minimum of 6m from the projector-camera system. 2. The implementation should aim to achieve a maximum 20mm median projection location error and a maximum 1° median orientation error when projecting on a planar object at 3m distance from the camera. 2.6m 5.475m 5.765m Figure 6.5 Two views of a Lab Environment: (left) South-West Elevation, (right) North-East Elevation We implement smart objects using Smart-Its particle hardware [Decker, Krohn et al. 2005], with attached “ssimp” sensor board part (see section 2.2.1). A steerable projector-camera system (shown in Figure 6.6) similar to the ones described by Ehnes et al. [Ehnes, Hirota et al. 2004]and Butz et al. [Butz, Schneider et al. 2004] is implemented from a moving-head display light pan and tilt unit, a lightweight DLP projector and a firewire camera. As our scenarios incorporate mobile 108 6.3 IMPLEMENTATION objects, there is great benefit in using a moving head steerable projector system, as it allows a single system to be mounted in a location with the best view of the whole environment (the centre of the ceiling) and maximise the volume in which objects can be tracked. Figure 6.6 The moving-head steerable projector-camera system The pan and tilt unit is controlled by the computer using the DMX512 serial protocol and can move through an addressable 540° pan angle and 265° in tilt at up to 180° per second. The unit can rotate the projector-camera system hardware in two real-world dimensions – horizontal (pan) and vertical (tilt), about the approximate centre of projection (COP) of the projector. The projector is a 2800 Lumen Casio DLP projector with large 2x zoom capability, allowing us to maintain relatively high resolution projection on objects at large distances. The firewire camera is a Pixelink 1280x1024 resolution colour CMOS camera capable of 27 frames per second (fps) at full resolution and up to 104fps with a 640x480 region of interest enabled. The camera is used with a 12mm C-Mount lens with 40x30° field of view (FOV). More information on the steerable projector-camera system is found in Appendix A. The implementation of the architecture components are described in detail in sections 6.3.4 to 6.3.3. Each component was implemented in C++ as a stand-alone application on the network and use a common messaging format for communication, as described in section 6.3.6 below. 6.3.1 Detection and Tracking The detection and tracking service is started with an XML configuration file providing four further aspects of information: 1. 2. 3. 4. Name or unique identity (UID) of camera hardware (to address it on the network). Location in world coordinate system (WCS) if static. Orientation in WCS if static. Optical parameters or camera calibration file. If the camera system is a steerable system, the Pan and Tilt tracking component will automatically provide the system with the current hardware location and orientation. 109 6.3 IMPLEMENTATION The detection system attempts to detect all smart objects in the environment. As this would have a significant processing cost with large numbers of objects, the detection system first checks to ensure an object is visible. If the object has not been detected yet, or the Database Server indicates the object is not currently tracked, then the system automatically attempts to detect it. If the Database Server stores a location and orientation for the object and indicates it is currently tracked, the detection system calculates its viewing frustum in the world coordinate system based on its current location and orientation. The location of the object is checked against the viewing frustum to see if the object is inside. Only when the object is potentially visible does the system attempt to detect it. One detection algorithm per cue (colour, texture, shape, features) is implemented in the detection system, as described in section 4.2. Algorithms are implemented using OpenCV [IntelOpenCV 2007] to benefit from its built-in algorithm optimisations for Intel x86 CPUs. Additionally, as many image-processing tasks are inherently parallelisable, we also implement components of the detection system on the graphics card (GPU). The user can switch between CPU-only and combined CPU-GPU implementations by re-compiling the detection system application with different preprocessor parameters. Common tasks such as gaussian blur, absolute difference, background subtraction and histogram creation are all implemented on the GPU using NVIDIA’s “Compute Unified Device Architecture” (CUDA) language. This allows algorithms to be written in a C based language, yet benefit from a typical 10-15% speedup when executed on the GPU. For example, a CPU optimised version of the SIFT local feature algorithm takes approximately 333ms to detect a single object in a 640x480 pixel image [Lowe 2004], whereas a GPU version only takes 100ms [Sinha, Frahm et al. 2006]. As CPU-GPU data transfers are typically a bottleneck, each camera image is uploaded to the GPU following capture and as much processing as possible is performed on the GPU before returning any result data to the CPU. 6.3.1.1 Figure-Ground Segmentation for Object Motion Detection The detection method selection is performed for each object based on what appearance knowledge the object holds, the object context (which we can obtain from a background model) and whether the object contains a movement sensor. To generate a motion mask and basic figure-ground segmentation of the target object the current system architecture implementation uses a simple absolute difference operation between the previous camera frame and the current frame whenever the object’s sensors detect motion. This method is fast to compute and works well when the camera is static or only moving slowly, increasing the probability of correct detection (as demonstrated in section 5.2). Another method for mobile or fast moving cameras that are constrained to move in a known path (e.g. a steerable projector) is to maintain a background model. To create the background model for steerable projectors we implemented a separate application to rotate the steerable projector to cover their whole Field Of View (FOV) while capturing camera images. For moving head steerable projectors the FOV is typically a full hemisphere. The application also requires knowledge of the camera Field Of View (FOV) to calculate how many photos are required to achieve full coverage. Each photo overlaps with its neighbours by 1/3, allowing automatic stitching and blending of the images into a single rectangular texture. An example of a 360x90º FOV background 110 6.3 IMPLEMENTATION model stitched from 16x4 overlapping images then re-projected into a hemisphere can be seen in Figure 6.7. As the capture and stitching process typically takes several minutes, we assume it is performed off-line before object detection (for example, the environment could be scanned automatically overnight). Figure 6.7 View down into a background model re-projected into a hemisphere, captured by rotating a moving head steerable projector through a 360x90º FOV. During on-line detection the known pan and tilt unit orientation and camera FOV allows us to calculate the area of the background model currently viewed by the steerable projector for visual differencing, for use as context information or for update. To reduce both memory requirement and the chance of any small inaccuracies in the pan and tilt unit orientation affecting the visual differencing we can reduce the resolution of the stored background model to 50% or 25% of the original size. To increase robustness we could also use visual differences generated between the camera image and a Gaussian-mixture model of the background [Stauffer and Grimson 1999] to provide a basic figure-ground segmentation to the detection algorithms. The Gaussian-mixture mode model is calculated independently for each pixel in the stitched rectangular texture background model and selectively updated based on the area of the background model currently viewed by the steerable projector. For mobile, handheld or wearable cameras a background model can be built using structure-from-motion SLAM techniques [Davison and Murray 2002; Davison 2003; Chekhlov, Gee et al. 2007; Klein and Murray 2007]. 6.3.1.2 Detection Method Selection If the object only has knowledge of a single appearance cue we automatically choose the respective detection method. However, as explained in section 6.2.1, when an object has appearance knowledge for multiple cues we can choose whether to perform a single detection method which prioritises an aspect of detection (speed, accuracy or robustness), or perform multiple methods and combine the results. The multi-cue detection results in Chapter 5 indicated only the best two cues contributed significantly to successful object detection, hence when detection metrics are available for multiple methods we can choose the top two. We select multiple algorithms in two cases: 111 6.3 IMPLEMENTATION 1. For first detection of an object, to maximise the chance of detection. Here we process the top 2 algorithms sequentially, in order of ranking, and use the detection results of the first algorithm as a figure-ground segmentation for the second algorithm. 2. When the average runtimes of the top two ranked algorithms are less than 10% different. In this case we execute the algorithms in parallel, either in multiple threads on the CPU (taking advantage of multi-core CPUs) or with one on the CPU and one on the GPU. This approach retains the benefit of a multi-cue system with minimal impact on performance. The cue results are combined at the end of the processing step using a binary OR operation on the detection areas in the camera image for colour, texture and shape, or by masking the detected local features before the pose calculation step. The detection process aims to maintain between 5 and 60 frames per second for interactive system performance. As we process every camera frame, the exact frames per second achieved will vary, depending on the user-configurable camera frame rate, the execution time of the detection algorithms and the number of objects to be detected. To reduce processing time for multiple objects we separate detection algorithms into operations which are common for multiple objects (such as the initial difference-ofGaussians pyramid creation for the SIFT local feature algorithm) and object specific operations (such as matching the object appearance to the processed camera frame). While requiring sufficient memory to store all intermediate results, this allows the majority of image processing to be performed only once per camera frame, irrespective of the number of objects using the detection method. 6.3.1.3 Pose Calculation We perform a two-step pose calculation. Firstly, the object pose is calculated either directly from matched local feature correspondences or by fitting the 3D model to edges detected in the 2D image region from the detection step. The local feature pose calculation algorithm calculates a homography matrix for a planar surface and for nonplanar objects the Direct Linear Transform (DLT) algorithm is used. The calculated matrix is decomposed to extract the location and orientation of the object with respect to the camera. The 3D model fitting initially detects edges in the camera image around the location of detection using the Canny algorithm [Canny 1986]. A RAPiD-like algorithm [Harris 1993; Armstrong and Zisserman 1995; Gee and Mayol-Cuevas 2006] is then used to project lines from the model into the camera image at multiple orientations and scales, centred around the detection location and test to see which pose has the best fit to the detected edges. If the smart object contains 3D accelerometer sensors these can additionally be used in the pose computation step when performing the 3D model fitting. In this case the sensed gravity vector is directly used to constrain the orientation, and hence the number of 3D model poses that must be tested to match the edges detected in the image, as explained in section 5.3.4. These detection and pose calculation steps requires the camera calibration and use the RANSAC algorithm [Fischler and Bolles 1981] for both robust model parameter estimation and for eliminating incorrectly matched correspondences. This initial calculation is followed by the second step where we perform an iterative pose refinement to increase accuracy, using the Gauss-Newton algorithm. An example of the 112 6.3 IMPLEMENTATION registration of an object with its 3D model using the result of the pose calculation is seen for the book object in Figure 6.8 (left). 6.3.1.4 Tracking There are three main sources of latency in the implementation – camera frame acquisition, image processing for object detection and projection. For a camera running at 30Hz the frame acquisition takes up to 33.3ms, while for a 60Hz projector a frame is projected every 16.7ms. Hence, maximum latency before image processing is 50ms (assuming unsynchronised cameras and projectors). It was reported by Brooks [Brooks 1999] that users of projector based interactive systems routinely accept total system latencies of 150ms, so we should aim to perform the detection step below 100ms. In contrast, the CPU runtime of the natural appearance algorithms we use in the architecture is between 288ms and 852ms per 1280x1024 pixel frame (as shown in section 5.2.4). For the demonstration applications presented in Chapter 7, these natural appearance detection algorithms do not perform in real-time, hence, our implementation uses a tracking system to speed up the detection process after first detection. A recursive tracking approach constrains the area of each camera frame used to detect the object based on its previous location, increasing the frame rate to interactive levels (>5fps). An example demonstrating tracking of multiple detected objects is seen in Figure 6.8. To increase tracking robustness we change the tracking method used to use one of two methods, depending on the appearance description in the object. The initial detection is performed by the natural appearance cues, then either the 3D model is used for RAPID-like tracking [Harris 1993; Armstrong and Zisserman 1995; Gee and MayolCuevas 2006] and pose calculation, or corner features are extracted on the object’s surfaces using FASTcorners [Rosten and Drummond 2005; Rosten and Drummond 2006] and a CONDENSATION particle filter [Isard and Blake 1998; Pupilli and Calway 2006] used for recursive tracking, with Normalised Cross Correlation (NCC) of image patches around detected corners on the object. NCC is not scale or rotationinvariant, hence this tracker fails and must be re-initialised whenever the object is scaled or rotated (both in the camera plane and in 3D) significantly from the original pose where features were extracted. Figure 6.8 (left) Image of detected of Book object with overlaid 3D model and object coordinate system axes, (right) Four detected objects tracked simultaneously - the Chemical Container, the Book , the Card and the Cup (green lines denote the maximum extent of the 3D model). 113 6.3 IMPLEMENTATION Objects with surface texture (texture or local features appearance descriptions) use NCC tracking, while the RAPID-like tracker is used for objects with well defined edges (for the colour or shape appearance descriptions). If appearance descriptions allowing use of both the NCC tracker and RAPID tracker are present, NCC takes precedence. A complete detection step is performed again whenever tracking is lost. Loss of tracking is determined by a tracking quality figure, dependant either on the number of line-segments matched for the RAPID-like tracker, or the number of NCC image patch features matched (hence, on the similarity of the extracted features to features seen in the current image) in the particle filter. 6.3.1.5 Vision-Based Interaction Detection Messages sent from the smart object requesting interaction components (such as buttons and sliders) contain the location for virtual interactive areas on the object’s 3D model and values to return when activated. For each camera frame, the system calculates whether any of the active areas on the set of visible objects are themselves visible. If an interactive area is visible it is extracted from the camera image by first projecting the 3D corner locations into the image using the known object location, orientation and camera calibration. The quadrilateral area in the camera image is unwarped to a rectangular image of user-specified size, typically 100x100 pixels. A finger tracking process is run on the rectangular image which performs a normalised cross-correlation of a semi-circular fingertip template image with the image area. The maxima locations are found in the result image and if greater than a userdefined threshold (typically 0.6 or 0.7) the template is considered a match. This interactive area is monitored whenever the object is visible and a motion history silhouette image created using accumulation of background subtracted images of recent frames. This silhouette image enables calculation of the gradient orientation, allowing us to check for the “lightening strike” touch of button activation whenever a fingertip template is matched inside the interactive area, as explained in section 2.3.5. For an interactive button, the first fingertip match inside the area with the correct motion profile is considered as button activation and the process returns the value requested. We assume the finger is still inside the area while the cross-correlation results return a match, so prevent repeat activation of the button if the user’s finger is hovering. An interactive slider returns a value depending on the location of the maxima along the largest axis of the slider area. This axis is interpolated linearly to return a value between the minimum and maximum values specified by the smart object. 6.3.2 Projection In a similar way to the detection service, the projection service is started with an XML configuration file providing four further aspects of information: 1. Name or unique identity (UID) of projector hardware (to address it on the network). 2. Location in world coordinate system (WCS) if static. 3. Orientation in WCS if static. 4. Optical parameters of the projector. 114 6.3 IMPLEMENTATION If the projector system is a steerable system, the Pan and Tilt tracking component will automatically provide the system with the current hardware location and orientation. For the implementation of the projection service we use OpenSG API scenegraph [OpenSG 2007], providing an object-oriented approach to building a world model. The FreeGLUT and OpenGL APIs are used for windowing and graphics rendering respectively. The scenegraph models the real world, with a virtual camera in the scenegraph placed at the location of the real-world projector and modelled with the same optical characteristics of the real-world projector from the configuration file. Any object registering for a projection has their 3D model added directly to the scenegraph hierarchy. The object’s surfaces are set to black (invisible) and the model placed at the location of the real-world object in the 3D scene when its pose is calculated. When the smart object requests a projection its message includes both the content to project (which can be images, text or video or a URL where content can be found) and the location to project it. We use the OpenCV [IntelOpenCV 2007] API to load the content to project, supporting the following image file formats: BMP, DIB, JPEG, JPG, JPE, PNG, PBM, PGM, PPM, SR, RAS, TIFF, TIF and AVI video files. The system can project onto any object area visible to the projector. The location description refers to the projection location abstractly or specifically. Abstract locations refer to faces of the object’s 3D model by their name. For example, a projection can be requested on the top or front face if these are declared in the 3D model. A smaller or more specific area can also be specified as coordinates in the 3D model coordinate system, allowing exact placement and sizing of the projection on an object. In this case, for the duration of the projection a new planar polygon is created with the specified coordinates and added to the 3D model. To create a projection we texture map the content of the projection either directly onto one of the faces of the object in the 3D scene, or onto the newly created polygon. This texture image is automatically updated in the case of video content to achieve the correct frames-per-second playback speed. 6.3.2.1 Geometric, Photometric and Colorimetric Correction Geometric correction is automatic for planar surfaces due to the use of projective texture mapping, where the virtual scene arrangement mirrors the real-world object and projector relationships. Similarly, we use the calibrated real-world projector intrinsic parameters for the optical characteristics of the virtual camera. With curved geometries we can use vertex and pixel shaders in NVIDIA’s Cg language for Quadric Image Transfer, as described by Bimber and Raskar in [Bimber and Raskar 2005]. We use a colour correction algorithm to change the texture image, correcting for nonuniform and non-white surface colours. An image of the object without projection can also be calculated as part of this process and used for object detection. The real-time algorithm by Fujii et al. [Fujii, Grossberg et al. 2005] is used for this step, however, the correction has a one camera frame delay cost, to allow the latest camera image to be used in the algorithm. Additionally, the algorithm requires an initial one-time projection of four colour calibration image frames (red, green, blue and grey) to recover the reflectivity response of the surface. The adaptation algorithm allows colour correction for each subsequent frame to be projected without projecting the calibration frames again. However, the algorithm cannot completely correct very saturated surfaces, as the dynamic range of typical projectors is not sufficient to invert 115 6.3 IMPLEMENTATION the natural surface colour. This effect can be seen in the projection on the red top surface of the box object in Figure 6.9 (left). Figure 6.9 (left) 3 surface projection on box object, (right) Sensed temperature projected on the non-planar smart mug surface - the blue wire at the right is the antenna of the Smart-Its device 6.3.2.2 Projector Powered Focus Control When a projector-camera system includes a projector with a powered focus the architecture can control this to dynamically focus the projector on the objects currently “in focus” in the architecture (“focus” as described in 6.2.3). The architecture requires calibrated focus data (calibrated as described in section 6.2.2) to be able to calculate the correct focus setting for an object distance. The focus setting-distance calibration data is loaded from a configuration file and a non-linear function fit to the data. Whenever the object “in focus” changes, the architecture outputs focus-near or focusfar commands, depending on whether the object is closer or more distant than the current focus distance. If the projector includes a serial port for computer control, the Focus Control can output projector-specific control strings. These control codes are loaded from a configuration file and consist of a series of bytes for the focus-near and focus-far functions. These codes are output on COM2 serial port when required. If the projector does not have a serial port, LIRC [Bartelmus 2007] can be used to control the projector functions by generating Infrared remote control codes to a connected IR transmitter. Here again, the codes are loaded from configuration file and consist of a series of bytes for the focus-near and focus-far control functions. The codes are sent to a network port where the LIRC API listens and translates the codes to a series of pulses to be sent to an attached IR transmitter which converts them to IR light pulses. IR transmitters can either be purchased or easily be constructed. As there is no focus setting feedback from the IR remote control method the architecture outputs a continuous stream of focus-near commands for several seconds when the software is first started, to guarantee the focus has moved to the closest focus setting. The calibration data can then be used to change focus from this known initial focus setting. In this case, focus commands are continuously sent with a pause of 0.5s in-between to give the hardware time to react, until the correct focus distance is reached. 116 6.3 IMPLEMENTATION 6.3.2.3 Projector Powered Zoom Control In a similar way to the dynamic projector focus control, if a projector-camera system includes a projector with a powered zoom we can dynamically zoom-in the lens to achieve a high resolution projection on distant objects or zoom-out to project on large close objects. With dynamic zoom we can trade-off projection resolution for Field Of View (FOV). However, in addition to changing the projector intrinsic parameters, changing the zoom changes the projector-camera transformation (extrinsic) matrix. Hence, projectorcamera calibration must be performed for every zoom step (as described in 6.4) and all the intrinsic and extrinsic matrices loaded to be used at the respective zoom setting. We additionally require calibrated zoom data (as described in section 6.2.2) which describes the projector image FOV and corresponding zoom settings. An identical control method to that used for Projector Powered Focus Control is used to change the zoom – either by output projector-specific control strings to the serial port, or by sending Infrared remote control codes to the LIRC network port. 6.3.3 Pan and Tilt Control The pan and tilt component is used whenever either (or both) projector or camera hardware is attached to a pan and tilt unit. In a similar way to the detection and projection services, the pan and tilt component is started with an XML configuration file providing four further aspects of information: 1. Location of the Centre of Rotation (COR) of the Pan and Tilt platform in the world coordinate system. 2. Unique ID (UID) of Projection hardware and Offset transformation of the Projector from COR (if applicable). 3. Unique ID (UID) of Camera hardware and Offset transformation of the Camera from COR (if applicable). 4. Calibration of the Pan and Tilt unit, to allow control in angles rather than hardware units. The pan and tilt component automatically searches for and tracks mobile smart objects in the environment. When only a single object exists in the environment and its location is unknown, the pan and tilt component will automatically search for it using the creeping line algorithm. When multiple unknown objects exist the component will keep searching until it either finds all the objects, or an object requesting a projection (in which case it will track this object). If an object being tracked stops requesting a projection, the component will return to searching for objects if there are objects with unknown locations in the environment and no other objects requesting projection. If another object with previously known location requests a projection it will return to this location, perform an expanding-box search to locate the object and track this object. If multiple objects request projection the tracking is performed using a 3D clustering with the K-means algorithm on location information of the objects. This is followed by selection of the target objects to be tracked based on recent object interaction history (we assume interaction takes the form of change in location, orientation or manipulation of the object). We track whichever object in the cluster was last interacted with. This aims to track the largest group of objects where interaction has recently occurred. 117 6.3 IMPLEMENTATION 6.3.3.1 Pan and Tilt Hardware Control Pan and Tilt hardware control is achieved by outputting value strings via the serial port to a serial-to-DMX512 converter. The strings output channel number (relating to control channels in the pan and tilt hardware), followed by 8-bit channel value (0-255). The Pan and Tilt unit calibration loaded from the configuration file specifies how these channels and values map to pan and tilt angles in the real world, allowing the architecture to control the unit without knowledge of the hardware. The pan and tilt component automatically updates the detection and projection services with the location of the camera or projector respectively, based on knowing which hardware is attached to the unit and the current location and orientation of the pan and tilt unit derived from the motion simulation discussed in section 6.2.3. If the projector or camera hardware does not have a static pan and tilt unit (such as mobile or handheld projector-camera system) another application is required to calculate and update the detection and projection services with the hardware location and orientation. 6.3.4 Object Proxy Messages from the Smart-Its device are broadcast on an RF channel to Smart-Its bridge hardware, which re-transmits the packets on the local IP subnet as UDP messages containing the data in the Smart-Its AwareCon protocol [Decker, Krohn et al. 2005]. To replicate the existence of a discovery service a listener application was implemented to automatically start and stop smart object Proxy applications. The listener application listens for network traffic generated by Smart-Its devices. If a new Unique ID (UID) of a Smart-Its device appears on the network an Object Proxy is spawned to automatically handle communications with the object. The listener application maintains an internal database of devices active on the network. If a device has not been active for a pre-determined period of time it sends a query message to the device. If no response is received following a short time-out the listener sends a stop message to the respective Object Proxy. The Object Proxy is implemented as an application which converts messages between the AwareCon protocol and the EiToolkit [Holleis 2005] protocol used for inter-process communication in our architecture. There is no direct routing of messages from the proxy to services due to the absence of a discovery service; instead all messages are simply broadcast as UDP messages on the network. 6.3.4.1 State Machine and Sensor Data Abstraction to Events The proxy implementation is generic. Each proxy spawned by the listener application is configured by an XML Object Model file either loaded directly from the Smart-Its device, or from an accessible network resource. The XML file contains specifications of sensor configurations for each state. A sensor configuration is composed of one or more sensors and associated sensor ranges which the sensor must be between to enter the state. Three separate classes of states are defined in the XML file, for the three associated object characteristics of movement, geometry or appearance change, and for projected content modification, as discussed in 6.2.2. 118 6.3 IMPLEMENTATION Abstraction of sensor data to events is performed either in the smart object or in the proxy itself, depending on the performance of the smart object hardware. Conversion of streamed sensor data in the proxy is implemented using operations performed by the Common Sense Toolkit (CSTK) [VanLaerhoven 2006]. CSTK provides type independent operations with the support for a variety of mathematical methods, such as min, max, mean, median, variance, running variance, standard deviation. The XML Object model file specifies the buffer length of raw data samples on which the operation should be performed, together the name of the sensor, operation and its associated numerical minimum and maximum in ascii text. Conversion of sensor data and state evaluation is performed by the proxy at between 30 and 100 iterations per second (this is user configurable). When the proxy is first started, the initial states will be broadcast following filling of the CSTK buffers with data to avoid incorrect results from initial operations. The proxy then broadcasts an event message for any subsequent change in state. On receipt of messages addressed to the smart object, the object proxy caches the messages if raw sensor data is being streamed to the Object Proxy. As all important operations occur in the proxy, periodic updating of the smart object is sufficient to allow state synchronisation while maintaining bandwidth for sensor data streaming. 6.3.5 Database Server When smart objects register a cache of their Object Model is stored in a vector by the database server. A registering smart object must provide the following minimum information from the Object Model: 1. A name or unique identity (UID). This is usually the IP or MAC address. 2. A 3D model of the current object shape in VRML format. 3. An appearance model containing a minimum of one appearance description. The information is either included in the message as XML, or optionally the messages contains the URL of an XML file containing this information on an accessible network resource. Similarly, when a detection or projection service registers, this information is again stored in a vector by the database server for use in dynamic projector and camera pairing. The following configuration information must be provided: 1. A name or unique identity (UID) of the hardware (to address it on the network). 2. The hardware type – either projector or camera. 6.3.6 Networking Protocol Implementation The EiToolkit [Holleis 2005] (developed in conjunction with HCILab, LudwigMaximilians-University Munich) was used to provide a common networking component for all distributed applications. EiToolkit provides support for event-based programming, with mechanisms for receiving messages, message decoding and function callback execution. Interface querying is also supported, allowing a proxy or application to interrogate the 119 6.3 IMPLEMENTATION capabilities of others. EiToolkit also supports automatic translation between Smart-Its [Decker, Krohn et al. 2005] sensor node AwareCon protocol and the EiToolkit native format via a dedicated proxy. The message format of EiToolkit is plain text, in the form: <senderID>:<msgType>:<message>:<destinationID>. The senderID and destinationID are unique identifiers, and can be the IP of the hardware running the proxy or application, or a text-based user-defined unique identifier (e.g. “Projector1”). Message broadcast is possible by specifying the destinationID as ‘*’. The msgType defines the command or interface callback to execute on the receiving system. The <message> is the payload or data to be used in the executed callback. SetCameraLocationOrientation Detection ObjectRegistered RemoveObject UpdateObjectModel SetObjectLocationOrientation GetObjectLocationOrientation FocusOnObject FocusOnObjects AddVirtualDisplay AddVirtualButton AddVirtualSlider Reset SetProjectorLocationOrientation GetObjectSurfacesImage SetObjectSurfacesImage Projection ObjectRegistered RemoveObject UpdateObjectModel SetObjectLocationOrientation FocusOnObject FocusOnObjects AddVirtualDisplay RegisterDetectionService AddVirtualButton UnRegisterDetectionService AddVirtualSlider SetButtonState DetectionUID SetSliderState Reset UpdateObjectLocationOrientation RegisterProjectionService UnRegisterProjectionService RequestDetectionUID Database Server UpdateObjectLocationOrientation SetButtonState SetSliderState Reset RegisterObject UnRegisterObject UpdateObjectModel AddVirtualDisplay AddVirtualButton AddVirtualSlider ObjectMoving ObjectStatic Object Proxy ObjectRegistered RemoveObject SetObjectLocationOrientation GetPanTiltLocationOrientation FocusOnObject FocusOnObjects ObjectTracked ObjectNotTracked ObjectMoving ObjectStatic Pan Reset SetPanTiltLocationOrientation & Tilt StartProxyWithUID StopProxy Network Listener Figure 6.10 Architecture Message Protocol and Routing. We designed a communication protocol based on the conceptual framework in Chapter 3. Figure 6.10 illustrates the architecture message protocol and typical routing. Red messages are from other components to the Database Server, Blue messages are from the Database Server to other components, green messages are Pan and Tilt position updates and pink messages are for colour correction. These messages used in the system architecture are broadly separated into four groups: 1. Messages between the smart object Proxy and Database Server. 2. Messages from Database Server, common to multiple applications. 3. Messages for between Database Server, Detection Application and Projection Application for colour correction. 120 6.3 IMPLEMENTATION 4. Messages from Pan and Tilt to either or both Detection and Projection Applications. These four groups are explained in more detail below. 1. Messages between the smart object Proxy and Database Server are listed below, together with the message contents. RegisterObject UnRegisterObject UpdateObjectModel AddVirtualDisplay AddVirtualButton AddVirtualSlider ObjectMoving ObjectStatic Registration request when proxy starts. Message contains object ID. UnRegister request sent when proxy stops. Message contains object ID Message contains object ID and updates to either, or both, model geometry and object appearance. Projection request message. Message contains object ID, content to project (or URL) and location to project on object. Interactive user interface request message. Message contains object ID and location of button on object and sensor values to return when clicked. Interactive user interface request message. Message contains object ID, location of slider on object and sensor values to return when used. Event message when smart object is moving. Message contains object ID. Event message when smart object is static. Message contains object ID. 2. Architecture Messages Common to Multiple Applications RegisterDetectionService UnRegisterDetectionService RegisterProjectionService UnRegisterProjectionService ObjectRegistered RemoveObject UpdateObjectLocationOrientation SetButtonState SetSliderState SetObjectLocationOrientation Registration request when detection service starts. Message contains uniqueID of hardware. UnRegister request sent when service stops. Message contains uniqueID of hardware. Registration request when projection service starts. Message contains uniqueID of hardware. UnRegister request sent when service stops. Message contains uniqueID of hardware. Sent by Server when object registers. Message contains object ID. Sent by Database Server when object unregisters. Message contains object ID. Location and Orientation Hypothesis sent by detection application when object is detected. Message contains object ID, location as 3 numerical coordinates in world coordinate system and orientation as 4 numerical values of a quaternion. Button state (pressed or not pressed) sent by detection application. Message contains object ID and sensor values. Slider state (in use or not in use) sent by detection application. Message contains object ID and sensor values. Location and Orientation update for smart object sent by Database Server. Message contains object ID, location as 3 numerical coordinates in world coordinate system and orientation as 4 numerical values of a quaternion. 121 6.4 PROJECTOR-CAMERA SYSTEM CALIBRATION GetObjectLocationOrientation FocusOnObject FocusOnObjects Object Tracked Object Not Tracked Reset Request for location and orientation of single smart object to detection application. Message contains object ID. Sent by Database Server when it detects interaction with object. Message contains object ID. Sent by Database Server when it detects interaction with multiple objects. Message contains multiple object IDs. Event message when smart object is detected visually. Message contains object ID. Event message when smart object is not detected visually. Message contains object ID. Sent by Database Server to reset all applications to their initial state. 3. Messages between Database Server, Detection Application and Projection Application for colour correction RequestDetectionUID DetectionUID GetObjectSurfacesImage SetObjectSurfacesImage Sent by the Projection Service to the Database Server to determine which detection service to contact when requesting images for colour correction. Message contains object ID. Reply by Database Server. Message contains unique ID(UID) of detection service detecting object. Request from Projection Service to Detection Service for smart object surface image. Message contains object ID and surface name in 3Dmodel or 3D surface location. Reply message from Detection Service. Message contains object ID, surface name or surface location and encoded image. 4. Messages from Pan and Tilt Application to either, or both Detection and Projection Applications (depending on whether the hardware is co-located or separate) SetCameraLocationOrientation SetProjectorLocationOrientation SetPanTiltLocationOrientation 6.4 Sent by PanTilt application to detection application when orientation of steerable camera changes. Sent by PanTilt application to projection application when orientation of steerable projector changes. Sent by PanTilt application to Database Server when orientation of a steerable platform changes. Projector-Camera System Calibration The architecture incorporates two methods to calibrate steerable projector-camera systems. Calibration can be performed at any time by running a calibration application, which calculates both the projector intrinsic parameter matrix, and the projector-camera transformation matrix (extrinsic parameter matrix). These calculations are identical to those used to calibrate the camera in section 2.6.4; however, here we assume the camera calibration is known a-priori. Both methods use correspondences between known points in the 2D projector image and 3D locations detected in the camera coordinate system (i.e. real-world) for calibration. One important characteristic of these methods is that 122 6.4 PROJECTOR-CAMERA SYSTEM CALIBRATION they implicitly calibrate any lens-shift on the projector by calculating an off-axis projection frustum in the projector intrinsic parameter matrix. The first method uses ARToolkit markers as a 3D location system to provide 3D point locations during calibration [Kato and Billinghurst 1999]. The method displays a series of 10 crosses sequentially on the projector. These crosses are arranged as two identical sets of 5 crosses, as shown in Figure 6.11 (left). To perform the calibration we align the centre of the ARToolkit 80mm calibration marker with the 5 projected crosses at a far distance to the projector (e.g. 3m), then repeat the alignment again at a near distance to the projector (e.g. 1m). For each cross location a correspondence between the 2D cross location and the detected 3D marker location in the camera coordinate system is established. The 10 correspondences and camera intrinsic parameter matrix are used to solve the linear system of equations for the 2D-3D projection using SVD. The result matrix can then be decomposed into the projector intrinsic parameter matrix, and the projector-camera transformation matrix. This calibration method takes about 10 minutes to perform and requires the user to manually align the marker with the projected crosses. The second method is the automatic calibration method proposed by Park et al. using projected structured light [Park, Lee et al. 2006]. This is a modified version of the method presented by Zhang for camera calibration [Zhang 2000] using co-planar 3D points and their corresponding 2D points. In Zhang’s method the 3D points detected by the camera are corners on a printed paper chess-board pattern of known 2D dimensions, whereas Park et al. project the chess-board pattern using the projector. A white planar surface is required to be placed where the projector and camera viewing frustums overlap for this calibration, and a camera-to-surface homography established. To calculate this homography, point correspondences between the camera image and surface can be established either by using vision based detection of the corners of the planar surface, by detecting fiducial markers on the surface, or by simply by identifying the corners pixels manually in the camera image. The homography is calculated using the methods described in section 2.3.6. As seen in Figure 6.11 (right), in the projector pattern image the corners m(x,y) are the 2-D points and the projected points M(X,Y,0) correspond to the 3D points. The corners in the projected pattern image are detected using the standard computer vision algorithms used to detect the equivalent paper pattern for camera calibration [IntelOpenCV 2007]. The actual 3D location of the projected points are then calculated with the following formula using the camera-to-surface homography (Hc-o) and the homogeneous corner coordinates of the camera image c. (X Y 1)T = H c −o c (6.5) The relationship between 3-D points (M) and 2-D points (m) is then represented in homogeneous coordinates as: m = PM = H o − p ( X Y 1)T (6.6) 123 6.5 DISCUSSION where Ho-p is the surface-to-projector homography. Zhang’s method is then used with the calculated 2D and 3D correspondences as normal, to recover the projector-camera transformation matrix (P). Figure 6.11 (left) Composite image of ARToolkit marker aligned with 5 far calibration locations and handheld for one of the near calibration locations, (right) Projector Calibration using projected pattern on planar surface [Park, Lee et al. 2006] For steerable projectors, if the projector Centre of Projection (COP) is at the pan and tilt unit Centre Of Rotation (COR), then following calibration we can invert the extrinsic parameter matrix to calculate the camera to pan-tilt coordinate system and hence (with known mounting location and rotation values) calculate the world coordinate system location of objects in the camera field of view. Following calibration, the projector-camera transformation will only remain valid in systems where the projector-camera relationship is fixed (for example, a static projector and camera, or when the camera is attached to the projector in a moving head steerable projector system). For moving mirror steerable systems we can use an alternate calibration method proposed by Ashdown and Sato [Ashdown and Sato 2005]. Similarly, for projector-camera systems where either the camera or projector (or both) is mobile, but where both intrinsic parameter matrices are known, we could use the imperceptible structured light techniques discussed in section 2.3.6 to continuously project a calibration pattern into a scene and hence attempt to recover the projectorcamera transformation. 6.5 Discussion In terms of the efficiency of the implementation, our approach currently does not detect objects in real time. Our architecture aims for a tracking-by-detection approach, but for the implementation we chose to integrate a particle filter [Pupilli and Calway 2006] to increase tracking frame rates following the first detection by the natural appearance algorithms. The benefit of using a particle filter over a Kalman filter in this case is that it allows us to model multiple alternative hypotheses, so it can automatically integrate detection results from multiple distributed cameras. It also better suits the nonlinear movement typically seen in handheld objects [VanRhijn and Mulder 2005]. Use of a tracking algorithm has the added benefit that we can de-couple the projection from the detection step by predicting future object motion from the pose history we maintain in the particle filter. The ability to predict object pose can be exploited by the projector, with its faster frame rate being used to “fill-in” projection on 124 6.6 DETECTION METHOD MEMORY EXPERIMENTS moving objects between camera frames. This is an issue as the lag of projection is very obvious for users during fast movement. For example, with a combined camera acquisition (33.3ms) and processing step (100ms) of 133.3ms and a typical human walking speed of 5kph the projection location would lag the handheld object by maximum of 18cm. However, reliance on prediction too far in the future can lead to other artefacts, such as lag in fast movement or swimming of the projected image. In the implementation we identified two limitations due to the Smart-Its sensor nodes used in smart objects. The first is that the wireless network bandwidth only allows a maximum of 2 smart objects to stream sensor data simultaneously while remaining synchronised with a 30Hz (33.3ms) camera refresh rate. This limitation is due to the fixed 13ms timeslots used for each node with the Smart-Its AwareCon networking protocol [Decker, Krohn et al. 2005]. However, this limitation only occurs if the smart objects are not powerful enough to process the state model and abstract sensor data to events so must stream raw sensor data to the Object Proxy. The second limitation is the maximum available 512KB of flash memory in the Smart-Its device. This limits the size of the Object Model; hence the amount of knowledge and content to project that can be stored in the object itself. We saw in the detection method memory experiments in section 6.6, that the amount of storage required to embed appearance knowledge already varies between 14.32 and 3223.50KB for just a single detection method. Consequently, our solution for larger appearance models or objects with large amounts of visual content to project is to assume availability of a network. We then either just embed a URL link to the whole Object Model in the smart object, or embed just the state definitions part of the Object Model in smart object together with URL links pointing to appearance knowledge and content to project on the network. 6.6 Detection Method Memory Experiments The Object Model stores trained appearance data for the detection methods, descriptive knowledge about the object such as its 3D Model and its sensors. Additionally, it contains collections of sensor ranges which represent object states. There are 3 sets of states evaluated by the Object Proxy application – at least one appearance configuration (which changes the 3D model and appearance knowledge for the detection system), an optional movement configuration (which represents the moving and non-moving states when an object has movement sensors) and the content configuration (effectively the program logic) which is a set of states together with a link to URL of content to project and the location to project, for each state (see section 6.2.4, Chapter 7 and Appendix B for examples). In the framework architecture implementation the body of the Object Model is encapsulated in XML. However, the Object Model may also consist of additional files, such as the 3D model and appearance representations for the detection system in standard file formats (such as Mikolajczyk’s affine-invariant feature file format from [Mikolajczyk and Schmid 2002]). Additionally, the Object Model XML file can be split into multiple sub-files to allow sharing of commonly used knowledge and configuration between objects, such as the sensor information (which is identical for all Smart-Its devices), appearance knowledge for identical objects or display model information for two objects that project identical content (see the discussion in section 7.1.8). 125 6.6 DETECTION METHOD MEMORY EXPERIMENTS When a smart object enters the environment, the type and amount of appearance knowledge stored in the Object Model varies depending on the detection method used to train the appearance representation. The minimum appearance representation can be created from a single view of the object. A more comprehensive representation requires incorporation of multiple-views to allow detection of the object in more poses when an object is rotated (as shown in section 4.4 experiments). A full appearance model will incorporate views of every surface of the object, for example, from the whole viewing hemisphere. The best angle between views is dependant on the detection method. Each view adds more information to the appearance model, so the angle between views is a trade-off between object coverage and memory requirements. We aim to obtain the best coverage with the least memory requirement. For example, Lowe recommends using 60° increments between training images for the SIFT local feature algorithm [Lowe 2004]. Using this increment an object requires a total of 10 equally spaced viewpoints for just the upper viewing hemisphere, or 14 for a whole sphere. In contrast, as shown in the section 4.4 experiments, the repeatability of the colour detection algorithm does not reduce significantly with rotation, so for 3D objects, one viewpoint per object surface would suffice. This experiment aims to evaluate the average memory requirements to enable successful detection of an object with the cooperative framework architecture. 6.6.1 Design The detection method memory requirements were evaluated by averaging the size of the whole Object Models (including 3D model and any additional files) used in the detection experiments, demonstration applications and over all objects in the training library, for each method. The appearance knowledge results were split into minimum requirements, where we only hold training data for a single view of the object (such as the front surface), and maximum requirement which we define as either a full viewing sphere (with 15 viewpoints at 60° increments) or a representation with 6-viewpoints (imagine the object inside a cube, where the cube’s 6 surfaces form the viewpoints) for the colour detection method. 6 views were chosen, as this would give a single view of each surface for 6 of the objects (the cubical objects) in the object appearance library (see section 4.3). However, in practice, uniform coloured objects will require less viewpoints, as each histogram will be similar, allowing viewpoint combination. 6.6.2 Results The mean average size of 26 VRML format 3D Models is 3.06KB. The mean average size of 26 Object Model files (including the 3D Models and all display configurations, but excluding appearance knowledge) is 22.65KB. The mean average size of 101 JPEG format images (with 95% quality) for projection at native projector resolution (1024x768) is 215KB per image. As can be seen from Table 6.3, the Local Feature appearance description required the most storage, followed by the Shape appearance. The Texture appearance required the least storage. This ranking is identical for both a single viewpoint and a full viewing sphere. 126 6.6 DETECTION METHOD MEMORY EXPERIMENTS Table 6.3 Memory requirements for Appearance Knowledge and the total Object Model with single viewpoint (minimum) and full viewing sphere (maximum) Method Local Features (SIFT) Texture (MagLap 32x32 Histogram) Colour (LAB 16x16x16 Histogram) Shape (100 points) Average Object Model Appearance Knowledge Memory Requirement (KB) Total Object Model Memory Requirement (KB) Minimum Minimum Maximum Maximum 214.90 3223.50 237.55 3246.15 14.32 210.00 36.97 214.80 41.00 246.00 (6- viewpoints) 899.66 63.65 268.65 82.63 922.31 59.98 6.6.3 Discussion The results indicate that the largest use of memory for all objects is generally the appearance knowledge (between 14.32 and 3323.50KB, compared to an average of just 22.65KB for the rest of the Object Model). However, the total memory requirement for the smart object will vary depending also on the number of projections (hence number of content configuration states in the Object Model) and whether the content to project is stored in the object with the Object Model. For example, the scenario described in section Appendix B.1, where a Chemical Container detects rough handling contains only 4 content configuration states and two images for projection requests. The Object Model with a single local feature appearance viewpoint of the front of the container is 202KB. With two images (430KB) the total size is 632KB. This can easily be reduced below 512KB to fit in a Smart-Its device, either by storing the images on the network and embedding only a URL link to them in the Object Model (as discussed in section 6.5), by storing the images with higher compression ratios, or reducing image resolution. In contrast, the smart photograph album described in section 7.2 (below) contains many content configuration states for different projections, which depend both on the state of the light sensor and the projected buttons. The total Object Model size with two local feature appearance viewpoints and 3D models (front cover and inside) is already 898.3KB. The 18 photographs used in the demonstration add 2325KB, for a total size of 3223.3KB (3.14MB). This size of Object Model cannot be stored in the 512KB memory of a Smart-Its device. Consequently, we store only a URL to the Object Model and its associated files in the smart object and assume the files are available on the network. While this limitation may be overcome in the future with increased memory capacity in smart objects, the actual transmission of large Object Models (on the order of Megabytes) from the Smart Object is still a big issue. The communication ability of the current generation of sensor nodes is typically limited, due to power, processing and network synchronisation constraints; hence, the transmission of large files would not be reliable, or fast. For example, a particle Smart-Its device using the Awarecon protocol transmits 64 bytes every 13.11ms at full throughput (equivalent to 4.88KB/s). Hence a 3MB Object Model would take a minimum of 10.74 minutes to send to the projectorcamera system, assuming no other devices are also transmitting. Similarly, an 802.15.4 127 6.7 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS node such as the imote2 with TinyOS and default radio parameters has an average transmission rate around 250 packets per second, each with 28 bytes (equivalent to 7.00KB/s). Hence, a 3MB Object Model would take a minimum of 7.49 minutes to send, assuming no other nodes are also transmitting. 6.7 Pose Calculation, Pose Jitter and Projection Accuracy Experiments These experiments investigate the pose and projection accuracy an operational system can expect when a target object has been detected in the image. These factors are especially of interest in Augmented Reality and Mixed Reality systems as the virtual image is overlaid either on a view of the physical world, or on the actual physical world with projector-based AR. Mis-registration of the image, or movement of the augmenting image relative to a static object in the real-world (jitter) is immediately detectable by an observer and can distract from the augmentation content itself. We perform experiments to answer the following research questions: R1) How accurately can an object’s location and orientation (pose) be calculated? R2) How accurately can a projection be located on a planar object’s surface when it is orthogonal to the projector? R3) What jitter in the calculated pose and projection can be expected? 6.7.1 Design An accuracy and jitter evaluation dataset was created as part of the experimental procedure to answer questions R1 to R3. The dataset contains the 5 objects from the object appearance library (see Chapter 4) with uniformly planar surfaces. These objects were the Book, Product box, Card, Notepad and Cereal box. The objects, were placed at 6 locations arranged as a grid in X,Y,Z camera coordinate space, with the object front surface parallel to the camera sensor as shown in Figure 6.12. The projector-camera system was placed in a static location at 3m distance from the grid, orthogonal to the object XY plane. The object was moved ± 0.5m from this location in the X and Y axes in turn. The grid and movement distance were chosen to maximise the movement size, while ensuring even the largest object was completely within the field of view of the projector at 3m. At each location 100 images of each object were captured and manually annotated with the object bounding box (6*5*100=3,000 images). The camera, lens and image capture resolution were identical to those in the object appearance library in Chapter 4. We calculate that with a 12mm fixed camera lens, 2/3” camera sensor and 1280x1024 resolution the 40.27° camera Horizontal Field Of View (HFOV) gives each camera pixel has an effective horizontal resolution of 0.03146° (1 arc-minute, 53 arcseconds). For a mid-zoom setting (20) of the 1024x768 resolution projector, the HFOV is 26.42°, and Vertical Field Of View (VFOV) is 19.32°. Each pixel has a measured size of approximately 1mm2 at 3m. 128 6.7 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS Figure 6.12 The location accuracy and jitter experiment object grid locations, orthoganal to the camera. Object test locations are the green circles. 6.7.2 Procedure 6.7.2.1 Research Question R1 The investigation of question R1 is split into two sub-experiments. The first investigates orientation accuracy using the rotation images of the 5 objects from the object appearance library. This sub-experiment evaluates both 2D rotation in the camera plane (rz) between 0° and 350° and general 3D rotation (rx,ry) – in this case around the object’s Y axis between -70° to +70°. Both exclude the 0° orientation as this pose closely matches the one used for training. The detection system was trained using images of the corresponding object’s front surfaces at 3m, from the scale image set. For both the 2D rotation and 3D rotation images a local feature detection step was performed on each test image of the objects, constrained to detect features only inside the object bounding box (simulating a near perfect detection system). The detected features were matched to the training image using Sum-of-Squared Distances (SSD) nearest neighbour matching, establishing feature correspondences. The pose calculation algorithm was provided with the corresponding image and object features and the camera calibration. The pose results and error from the manually measured pose was recorded. Failed detections were excluded from the error calculation (2D rotation had no failures, while for 3D rotation the rate was: Book 33.3%, Box 40%, Card 40%, Notepad 40%, Cereal Box 60% of the 70° to +70° range, with all failures evenly split between the two rotation extremes). The second sub-experiment investigating location accuracy in the pose calculation algorithm is part of the following procedure to also answer R2 and R3. 6.7.2.2 Research Questions R2 and R3 Initially, the projector intrinsic parameters (optical parameters) and the relative extrinsic pose between projector and camera were calibrated using the method discussed in section 6.4, with the projector lens zoom set to 20 (mid-zoom) and the camera and projector image focused at 3m. Then, for each of the 5 objects and at each of the 6 object locations in the accuracy and jitter evaluation dataset we perform the method described below. 129 6.7 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS The detection system was trained using an image of the corresponding object at 3m (the centre of our working range), from the scale images in the object appearance library. A local feature detection step was performed on each of the 100 captured test images of the object, constrained to detect features only inside the object bounding box (simulating a near perfect detection system). The detected features were matched to the training image using SSD nearest neighbour matching, establishing feature correspondences. The pose calculation algorithm was provided with the corresponding image and object features and the camera calibration and the pose results recorded. To answer question R1, the median pose over the 100 images was used when calculating the pose error for each location, to remove location jitter. The pose error was calculated by taking the difference between this median pose and the distance manually measured to the object by a tape-measure in the X,Y,Z camera coordinate system. We assume the manual measurements are accurate to ±5mm. Failed detections were excluded from the error calculation (only the notepad had failures, 6/600=0.01% rate). To answer question R2, the median pose calculated over the 100 images at each object location was used to also project onto the front surface of the object as the accuracy evaluation dataset was being created. The projection image was a 1 pixel line bounding box, exactly the size of the object. The 2D (X,Y) offset of the projected image corners relative to the physical object front corners was measured with a ruler. We assume this measurement is accurate to ±1mm. To answer question R3, we use the individual pose results from the 100 images of the location accuracy experiment. Pose location jitter is calculated as the difference between the individual poses in the 100 images and the median pose for the location. These results are averaged over each location, for every object. For projection jitter we project the bounding box image for each of the 100 calculated poses in sequence, while measuring and recording the 2D (X,Y) offset of the projected image corners relative to the physical object with a ruler. We calculate the median error of the offset results. We assume this measurement is accurate to ±1mm. 6.7.3 Apparatus A 3.4GHz dual core Pentium-4 computer running Windows XP SP2 was used for all experiments. The pose calculation and projection geometry correction algorithms were implemented in C++ using Intel OpenCV API for image processing and OpenSG scenegraph to create the image for projection (see section 6.3.2 for more implementation detail). 6.7.4 Results The 95th percentile measurements in the tables below give the error values at which the corresponding cumulative distribution reached 0.95, where 95% of all measurements will have an error equal or smaller than that value. 6.7.4.1 Research Question R1 As can be seen in Table 6.4, the median orientation error and 95th percentile results over all objects are similar for both 2D in-plane (Mdn=0.93°, P95=1.26) and general 3D rotation (Mdn=1.02°, P95=1.60). The most accurately detected 2D in-plane rotation was for the Book object (Mdn=0.56°), with the least accurate being the Box (Mdn=1.33°). The most accurately 130 6.7 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS detected 3D rotation was for the Card object (Mdn=0.51°), with the least accurate being the Notepad (Mdn=1.60°). Table 6.4 Median Rotation Error results for each object, over -70° to +70° Out-of-plane and 10° to 350° in-plane. 2D Rotation Error (°) Object Book Box Card Notepad Cereal Box Median Error 95th Percentile 3D Rotation Error (°) 0.56 1.33 0.84 0.93 1.00 0.93 1.26 1.59 1.03 0.51 1.60 0.72 1.02 1.60 As can be seen in Table 6.5, over all objects the Z-axis (distance to object) errors (Mdn=12.13mm) were generally higher than X and Y axis errors (Mdn=4.70 and 7.15mm respectively). Similarly, the variation in the Z-axis errors between objects was greater than both the X and Y axes, as demonstrated by the larger value for the 95th Percentile (P95=23.63). The most accurately detected object was the Cereal box (Mdn=8.66mm) for 3D error from the true location, while the least accurately detected object was the Book (Mdn=26.82mm). The median combined 3D location error of the detection and pose calculation system over all the objects was 14.25mm, P95=25.69. Table 6.5 Median Location Calculation Error in X,Y,Z Camera Coordinate System for each object, averaged over all grid locations Object X Error (mm) Book Box Card Notepad Cereal Box Median Error 95th Percentile Y Error (mm) 7.22 4.97 2.19 2.60 4.70 4.70 6.77 Z Error (mm) 5.14 11.76 7.15 8.11 6.12 7.15 11.03 25.31 16.87 12.13 5.46 3.94 12.13 23.63 Combined 3D Error (mm) 26.82 21.16 14.25 10.14 8.66 14.25 25.69 6.7.4.2 Research Question R2 Table 6.6 Median Projection Location Error on the Object X,Y front surface plane for each object, averaged over all grid locations. Object Book Box Card Notepad Cereal Box Median Error 95th Percentile X Error (mm) 7.75 18.00 6.25 7.50 3.50 7.50 15.95 Y Error (mm) 9.50 11.00 20.25 15.00 10.25 11.00 19.20 Combined 2D Error (mm) 12.26 21.10 21.19 16.77 10.83 16.77 21.17 131 6.7 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS As can be seen in Table 6.6, for projection the error in the X-axis is generally lower than in the Y-axis, except for the Box object (Mdn=18.00mm). However, the variation in the error in the Y-axis between objects is lower than in the X-axis. The object with the most accurate projection was the Cereal Box (Combined 2D Error Mdn=10.83mm), while the Card was the least accurate (Mdn=21.19mm). The median 2D location error of the projection over all the objects was 16.77mm, P95=21.17. 6.7.4.3 Research Question R3 As can be seen in Table 6.7, over all objects the 1.64mm median jitter in the Z-axis (distance to object) is much greater than both the X and Y axes (Mdn=0.20 and 0.26mm respectively). Similarly, the variation in the Z-axis errors between objects was greater than both the X and Y axes. The Cereal Box object has the highest jitter (4.15mm median 3D Jitter) around the median location, with the Card having the lowest (1.20mm median 3D Jitter). Over all objects, the median combined 3D jitter around the median object locations was 1.65mm, P95=4.04. With the mid-zoom projector setting, the size of 1 pixel was 1mm2 at 3m distance from projector. The projection jitter for the Card, Notepad and Cereal Box objects was observed at less than 1mm, too small to measure accurately with a ruler. Consequently, it was estimated as 0.5mm for the Card, Notepad and Cereal Box X and Y axes. Table 6.7 Median 3D Object Location Jitter from Median Location for each object, averaged over all grid locations. Object X Jitter (mm) Book Box Card Notepad Cereal Box Median Jitter 95th Percentile Y Jitter (mm) 0.20 0.98 0.17 0.13 0.98 0.20 0.98 Z Jitter (mm) 0.26 0.86 0.07 0.08 0.63 0.26 0.81 Combined 3D Jitter (mm) 1.60 3.65 1.18 1.64 3.98 1.64 3.92 1.63 3.87 1.20 1.65 4.15 1.65 4.04 Table 6.8 Median Projection Location Jitter from Median Location on the Object front surface plane for each object, averaged over all grid locations. Object Book Box Card Notepad Cereal Box Median Jitter 95th Percentile X Jitter (mm) 1.00 11.00 0.50 0.50 0.50 0.50 9.00 Y Jitter (mm) 1.00 1.00 0.50 0.50 0.50 0.50 1.00 Combined 2D Jitter (mm) 1.41 11.05 0.71 0.71 0.71 0.71 9.12 As can be see in Table 6.8, the median Y-axis projection jitter from the median pose is generally equal in the X and Y axes, except for the Box object. Additionally, the variation in the jitter in the X-axis between objects is much higher than in the Y-axis, as can be seen from the 95th Percentile results (P95=9.00 for X, 1.00 for Y). The objects 132 6.7 POSE CALCULATION, POSE JITTER AND PROJECTION ACCURACY EXPERIMENTS with the least jitter were the Card, Notepad and Cereal Box (all estimated at 0.71mm of 2D jitter), while the Box had the highest (Mdn=11.05mm). The median combined 2D jitter of the projection from the mean object locations, over all the objects was 0.71mm, P95=9.12. This is approximately equivalent to 1 projector pixel jitter at 3m. 6.7.5 Discussion 6.7.5.1 Research Question R1 To address R1 we performed two sub-experiments measuring location and orientation calculation accuracy following object detection. The orientation experiment was split into 2D and 3D object rotation, however, the results (Table 6.4) from both parts of the experiment were almost identical. This suggests that the magnitude of the orientation calculation error is consistent, irrespective of the type of rotation the object undergoes. The orientation calculation was on average accurate to 1.02° or better. This does not quite meet the rotation aim we set in section 6.3 (a maximum of 1° median error). As can be seen in the location experiment results (Table 6.5), the distance to object (Z) error is significantly higher than error in the camera X,Y plane. This is due to the difficulty involved in monocular camera systems calculating depth. Monocular systems rely on the perspective effect caused by a pinhole camera model. In this case, the distance of features on the object from each other increases the closer the object is to the camera, and decreases when the object moves further away. Due to the finite resolution of the camera, the exact pose therefore becomes more uncertain at larger distances to the object. Even a single pixel change in the location of features relative to each other can make a big difference. Stereoscopic camera systems alleviate this problem (to some extent) as they can make use of a known baseline separation between the cameras to triangulate the features. However, although increasing the stereo baseline increases the Z accuracy, it is a trade-off with the useable tracking volume, as both cameras require the same features in their field of view. 6.7.5.2 Research Question R2 The experiment to address R2 aims to understand how accurately the combined system (detection, pose calculation and projection) can project onto objects in the real world. Hence, the projection error shown in Table 6.6 is a combination of the accuracy of the camera intrinsic parameters calibration, location error from the pose calculation, the accuracy of the projector intrinsic parameters calibration and the accuracy of the calibrated transformation between the camera and projector (extrinsic parameters). The higher error, but lower variation in the Y-axis relative to the X-axis suggests possible calibration error, producing a fixed offset in the Y-axis. Any error in the Z-axis (distance to object) from the pose calculation will appear as scaling errors in the projected image, increasing the measured X,Y errors. Consequently, the error in location and projection can be compared best using the respective combined 3D and combined 2D error figures. In this case, the results show that following an accurate calibration of both projector and camera intrinsic and extrinsic parameters, the median combined 2D projection error (16.77mm) is similar to the median combined 3D error in the pose calculation step (14.25mm). This suggests that the major source of error in the projection is the pose calculation step. The 16.77mm median projection location error we achieve from the combined system meets the original implementation aim in 6.3 (a maximum of 20mm median error). 133 6.8 CONCLUSION 6.7.5.3 Research Question R3 To address R3 we measured the jitter of individual poses from the median pose for both the location pose calculation (Table 6.7) and for projection (Table 6.8). As can be seen when comparing Table 6.5 and Table 6.7, (with exception of the card object) generally objects with higher combined 3D location errors have higher jitter. Additionally, it can be seen when comparing Table 6.7 and Table 6.8 that the Combined 3D location jitter (Mdn=1.65mm) is much higher than the combined 2D jitter in the projection (Mdn=0.71mm). Looking closer at location jitter in Table 6.7 we see that (similar to the location error in Table 6.5) the largest component of the 3D jitter is from error in the Z-axis (Mdn=1.64mm). Figure 6.13 Error in Z-axis location for objects at a large distance from the camera causes smaller error in the X,Y location error of the projection The reason this large Z-axis jitter does not translate into large jitter in the projection X and Y axes is because error in the Z-axis contributes much less than error in the X,Y axis, when objects are at a large distance to the projector. For example, with an object at 3m, 5° off the camera axis, error of 5mm in the Z-axis leads to only 0.43mm error in both the X and Y axes of the projection, as illustrated in Figure 6.13. The results from the experiments are representative of the average accuracy obtainable with planar objects and the current camera and projector optical characteristics. Varying the camera lens or projector optical parameters will vary the results. For example, increasing the focal length lens (zooming-in) of the camera allows a higher accuracy in the pose calculation due to the higher spatial resolution, however, it trades-off field of view so large objects may no longer fit in the camera image. 6.8 Conclusion In this chapter we presented an architecture design and implementation validating the concept of cooperative augmentation. The implementation is a working system which enables us to investigate different aspects of the architecture and Cooperative Augmentation concept. The implementation enables augmentation of smart objects with a display capability without changing their natural appearance, using projector-camera systems. Our approach can locate and track mobile objects in the environment, align the projection with the object’s surfaces and correct for surface colour so the display appears undistorted and visible to a user. The main challenges in our approach are visual detection of smart objects, keeping the projection synchronised when the object is moved or manipulated and correcting the projection for non-ideal surface colours and textures. 134 6.8 CONCLUSION Our system implementation is currently geared more towards accuracy in detection and pose calculation, for example, we use a high resolution machine vision camera with 1280x1024 pixel resolution. However there is a trade-off between accuracy and detection runtime, hence our system does not currently achieve real-time detection or tracking. We will never achieve good accuracy from a poor camera image (e.g. from a 320x240 pixel web-camera), however we can eventually achieve real-time detection with high resolution cameras either through hardware improvements or implementation improvements. One major improvement can be readily made by converting more of the detection processing to use the GPU. From the pose calculation and projection experiment we determined that our system calculates the pose of planar objects with a median 3D error of 14.25mm from the correct position. The combined system can also project onto a planar object with a median 2D error in the location of the projection around 16.77mm. While this accuracy is sufficient for large objects such as the chemical container (see section 4.3), this might be more problematic with small objects due to their size. Here, a large error on a small object such as the cup can potentially offset the projection enough that part of the display is not visible, hence the importance of projection accuracy depends to some extent on object size. To help in detecting small objects, it may be useful to further characterise the detection algorithms and incorporate prioritisation of algorithms which have a higher detection performance with small objects into the method selection, as object size can be determined from the 3D model. The jitter experiments showed that although the median 3D pose jitter was 1.65mm, in the combined system projection only a median of 0.71mm jitter was observed, due to error in the Z-axis (distance to object) contributing significantly less to the projection jitter than error in the X and Y axes. While small, this jitter is visible in active projections when observing at close range. Jitter is also visible when objects are static for the reasons discussed in section 2.6.5. However, if smart objects possess movement sensing capability it is possible to smooth the pose to remove jitter only when we know the object is static. This avoids the lag on instantaneous movement typically associated with continuous smoothing. In section 6.5 we found that smart objects are typically limited in memory. Hence we studied the memory requirements for storing the Object Model. For the memory requirement we determined that the largest component was either the object appearance description or content for projection (e.g. images and video), but this was application specific. These findings are important, as the network transmission time is typically prohibitive for large Object Models. Hence, full appearance models with large or complex amounts of content to project must be stored outside the smart object on the network. This is not a problem as long as the object maintains a link to the location of the content on the external network resource and the external resource remains valid. 135 7.1 Chapter 7 SMART COOPERATIVE CHEMICAL CONTAINERS Demonstration Applications Three applications were developed to demonstrate the capability of the system architecture. These demonstrate three distinctly different areas of the architecture and discuss some of the issues involved in developing applications with our architecture: 1. A smart chemical container demonstrator, illustrating the dynamics of the whole cooperative process from initial object registration to knowledge updating. This uses pose results from the vision-based detection system to infer whether the smart container object is stored in the correct location, and with the correct chemicals. A warning projection is requested if detected in an incorrect location. 2. A smart photograph album demonstrator, illustrating interaction methods in the architecture. Here we concentrate on three interaction methods, allowing manipulation of the object location, object geometry and interaction with projected buttons to browse photographs. 3. A smart cooking scenario, illustrating multiple projector-camera systems augmenting multiple cooperating smart objects in order to provide the user with context-dependant recipe instructions to accomplish a cooking task. 7.1 Smart Cooperative Chemical Containers This demonstrator aims to illustrate the dynamics of the whole Cooperative Augmentation process by following a smart chemical container object from registration on initial entry into the environment, to first detection, use of embedded sensing, projection, interaction, knowledge updating then final exit from the environment. To motivate this demonstrator we use an industrial goods warehouse scenario where two smart chemical containers enter a warehouse and we would like the containers to warn the employee visually when a container is stored outside the correct storage area. We use an identical approach to that presented by Strohbach, Gellersen et al., where safety-critical storage areas are monitored by the smart chemical containers themselves [Strohbach, Gellersen et al. 2004]. Here rules for the safe handling of objects and materials are embedded in the objects and objects cooperate to detect hazardous situations (such as being stored next to reactive chemicals) using embedded sensing. 7.1.1 Object Model The Object Model appearance knowledge is initially trained with images of the Chemical Container at 3m, from the scale images in the object appearance library (see Chapter 4). The LAB Colour detection algorithm is run on the image in the annotated 136 7.1 SMART COOPERATIVE CHEMICAL CONTAINERS area containing the object, to extract a 16x16x16 bin LAB histogram. For this demonstrator we designate a 1m2 and 0.5m high 3D volume in the environment as being an approved storage area (X=1 to 2m, Y=0 to 0.5m, Z=1 to 2m). States are defined for when the object is in the correct storage area, and the incorrect area, as shown in Figure 7.3. An identical object model is embedded in all Chemical Container objects. 7.1.2 Registration As shown in Figure 7.1 (left), an employee enters the environment with two chemical container smart objects. The objects enter proximity of the projector-camera system; detect the presence of a display service and register. This process transfers Object Model from the object to the projector-camera system. The projector-camera system registers the objects, and returns a confirmation message to the containers. On receipt of this message the containers begin sending sensor events to the projector-camera system. In this case, they are being carried by the employee so embedded accelerometer sensors generate movement events. 7.1.3 Detection The registering objects trigger the detection process in the projector-camera system. Here the challenge is to simultaneously detect mobile or static objects and distinguish between objects with similar appearances. Figure 7.1 (left) New objects arrives in environment, (centre) An employee walks with containers, (right) The employee places one object on the floor The steerable projector now rotates from its current position to search the environment. As the objects have just entered, the system does not know their location. Consequently, the projector system uses a creeping line search pattern with a horizontal major axis to thoroughly search the whole environment. The projector uses the appearance knowledge embedded in the Object Model and the sensor events to configure its detection process. In this case the containers store knowledge of a colour histogram, and sense they are moving. This knowledge triggers the method selection step to choose colour and movement detection processes. The movement process generates a motion mask which is used by the colour detection process to constrain its search for the object by masking the back-projection result of the object’s colour histogram. As the two chemical containers look identical, two possible objects are identified in the image. It is not currently possible for the camera to distinguish between the objects. 137 7.1 SMART COOPERATIVE CHEMICAL CONTAINERS Consequently the steerable projector tracks the moving areas in the camera image by centring their centre of gravity. Both objects generate movement event messages while they are being carried by the employee. However, when an employee places one of the containers on the floor, as shown in Figure 7.1 (right), the container‘s movement sensors stop sending movement events. The projector-camera system now only detects one moving area and the system can differentiate between the objects directly based on sensing. A 3D location and orientation is now calculated (for an example see Figure 7.2) and sent wirelessly to the containers, completing the Object Model. 7.1.4 Projection Once an object’s 3D location and orientation is calculated by the projector-camera system, objects can request projection of content on their surfaces. Here the challenge is to correct the projection for the orientation of the object, and variations in its surface colour to ensure the most undistorted and visible projection. In this case, the container detects it was put down in the wrong storage area based on the location it was sent and requests a warning message is projected (see Figure 7.2). The projector-camera system projects the warning message on the front surface of the container objects with geometrical correction, to appear undistorted. Figure 7.2 (left) Detected container with green wireframe 3D Model superimposed on the camera view using calculated pose, (right) A warning message projection on two chemical containers 7.1.5 Manipulating the Object When projecting onto objects, the object can respond to sensed manipulation or network events by dynamically modifying the projected content. The challenge here is to keep the projection aligned with the object as it is manipulated or moved. The employee sees the projected message and picks up the object. The detection process continues to track it and generate 3D location and pose information. Consequently, the message appears to remain fixed to its surface as long as the surface is visible to the projector system. When the object is in the correct area it requests the projection stops. The employee puts down the container and sees the message disappear. The projector-camera system keeps tracking the objects. 138 Action Combination Method LocationZ -9999 9999 Value Value AND -9999 9999 9999 -9999 9999 Value LocationX LocationY 3000 Variance AccelXYZ Max Operation Min Sensor State Incorrect Location, Moving Content Configuration Content Configuration Movement Configuration 3DModel Configuration Sensors UID Object Model -9999 9999 Value Value Value LocationZ Action AND 1 1 0 1 0 Remove Projection Combination Method Value Value Value LocationX LocationY Projection Variance Operation Min 1 2 0.5 2 2999 Max Correct Location, Static AccelXYZ Sensor State Max Operation Min 2 – Moving Max Value Value Value Value Action Combination Method Projection LocationZ LocationX LocationY Variance Max 0 2 0.5 2 2999 AND 0 1 0 1 0 Operation Min Correct Location, Static AccelXYZ Sensor State Variance 3000 9999 Variance 3000 9999 Variance 3000 9999 Combination Method OR Accel X Accel Y Accel Z Sensor State Combination Method Max 0 2999 0 2999 0 2999 OR Operation Min Variance Variance Variance Sensor Accel X Accel Y Accel Z 1 – Not Moving State Movement Configuration front 139 SMART COOPERATIVE CHEMICAL CONTAINERS Figure 7.3 Partial Object Model representation of Chemical Container Demonstrator Action Projection1 incorr.jpg on front AND -9999 9999 -9999 9999 Max 2999 Value Combination Method LocationZ LocationX LocationY Sensor Operation Min AccelXYZ Variance 0 State Appearance Operation Min Chemical Container 3D Model Sensor State 3DModel Configuration Incorrect Location, Static Accel X Accel Y Accel Z Location X Location Y Location Z Sensors 7.1 7.1 SMART COOPERATIVE CHEMICAL CONTAINERS 7.1.6 Knowledge Updating If objects enter the environment with only partial knowledge of their appearance, their knowledge can be increased over time by performing extra detection processes and re-embedding the result into the Object Model. The challenge here is how to make the knowledge extraction accurate, given that the initial knowledge was incomplete. The two containers entered the environment only with knowledge of their colour, so the projector-camera system extracts more appearance knowledge over time. In this case, the SIFT algorithm [Lowe 2004] is used to detect scale and rotation-invariant features on the object just put down, as shown in Figure 7.2. The SIFT descriptors are calculated on small image patches around the detected interest points. The resulting 128 value feature vectors are mapped from the image to locations on the object’s 3D model using the known 3D location and orientation of the container. If the object is manipulated so it is rotated from its original pose new features will be detected as they come into view. The projector-camera system manages the Object Model local feature database to add new features or update the database if the object appearance is changed. The new local feature appearance knowledge is sent to the smart containers to be embedded in the Object Model and used for faster, more accurate detection in future. 7.1.7 Objects Departing the Environment When objects depart the proximity of the projector-camera system, their virtual object representation is removed by the projector system and the projector is free to track other objects. Here, the employee moves to the exit with the container that was never put down. This container continues to generate motion events. As there are no other moving objects or projections active, the projector system tracks the carried object, as shown in Figure 7.4. As the employee exits through the door with the object, the system looses sight of the object and it no longer responds to messages from the projector-camera system. The system assumes it has departed the environment after a short time-out. The projector-camera system then returns to the last-known position of the other container objects. If no objects are detected the projector system begins an expanding square search pattern centred on their former locations. Figure 7.4 (left) Scale and rotation invariant local features detected on chemical containers, (right) A container leaves the environment with the employee 140 7.1 SMART COOPERATIVE CHEMICAL CONTAINERS 7.1.8 Discussion The chemical container demonstrator provides important validation for our architecture and approach. Firstly, it successfully illustrates visual display on an object both while moving and when static. In both cases the projection appears fixed to the display location on the object. Secondly, it illustrates the interactivity of our approach, as the projected display changes automatically based on the states defined in the Object Model when the user moves the object to the correct location. Thirdly, it illustrates the benefit of embedded sensing in the detection process. Here movement sensing make the detection more robust when there are two identical objects by constraining the detection and masking the distracting object. Fourthly, we demonstrate the knowledge extraction and updating process by detecting local features and re-embedding this appearance knowledge into the object after initial detection with the colour cue. Fifthly, we show that the combined system detection, pose calculation and projection output is accurate enough to be visible on the barrel, both when it is mobile and when static. 7.1.8.1 Projection Resolution On a chemical container with an area of size 200x130mm suitable for projection, the achievable projector resolution with a fixed mid-zoom setting (20) and object at 3m is around 1 pixel per millimetre. This allows a resolution of 200x130 pixels on the barrel when it is orthogonal to the projector. A typical font can be created from a minimum of 8x8 pixels, allowing a maximum of around 25x16 text characters to be displayed. However, any geometric correction required due to the relative pose of the container surface will reduce this resolution further. Consequently, the projection is more suitable for large text warning messages, images and video than for large blocks of text. Changing the focal length of the projector (by zooming) to achieve a higher resolution projection is possible, but has the trade-off of reducing the field of view. 7.1.8.2 Object Cooperation to Overcome Occlusion A safety-critical system requires certain attributes such as guaranteed availability, reliability, integrity and dependability. The scenario presented by Strohbach, Gellersen et al. [Strohbach, Gellersen et al. 2004] relies on the containers themselves to sense when hazard situations occur using only ultrasound sensors on the objects and cooperation with other objects via a wireless network. Interactive projection in a safety-critical storage scenario faces the problem of guaranteed availability, as projector-camera systems rely on direct line of sight to the object. Occlusion of the object prevents both detection and augmentation with displays. However, in this case, it would be possible for objects to cooperate and achieve a display. For example, if we imagine a scenario where two identical objects exist in the same environment; the first object senses it has been stored outside an approved area, but is occluded from the projector-camera system by the other object, which is stored in front of it. In this case the framework is aware of the object, but cannot detect or project a warning message on it; it is only able to detect and project on the second object. To enable the first object to use the second as a display surface, we introduce an abstraction to our framework of an Object Class. This allows smart objects to share common attributes (similar to the Object-Oriented Programming (OOP) paradigm) in their Object Model. Shared attributes could be shape and appearance (such as the smart chemical containers), common functionality (such as two smart cups with different 141 7.2 SMART PHOTOGRAPH ALBUM appearances), or be part of a common task (such as individual smart ingredients in a recipe task). 7.1.8.3 Object Knowledge Sharing We allow objects in the same class to share knowledge using the same mechanism with which objects share sensor data. This would allow a brand-new object to enter the environment, query the network to see if any other objects of the same class exist. If an object of the same appearance is present the object can request parts of the Object Model (or the URL), such as recently trained appearance knowledge, to help overcome the hurdle of initial detection. The smart object is now composed of public and private data. Public data is the Object Model components shared with the projector-camera system. Private data consists of the raw sensor values and program logic encoded in the Object Model to change the content model, the movement model and the geometry model. The private data remains with the physical object or Object Proxy. The occluded chemical container object in our scenario now queries the network to see if any objects of the same class exist. The visible smart object responds, and the object can query the object location. If an object knows its current location, it is visible to the projector-camera system and can support projection. We allow objects in the same class to modify each other’s projections only if they have the same geometry and no other projection is currently active. In this case, the two objects are identical, so the occluded chemical container requests a warning projection on the visible container. The occluded container monitors its location and as soon as it is visible, or in the correct storage area, it removes the projection from the visible object. Allowing objects to query other object’s locations and sensors allows flexibility in application design. For example, it allows modelling of spatial relationships between objects (such as distance or angle between two objects, whether the object is left, right, in-front or behind, etc..) as described by Kortuem et al. [Kortuem, Kray et al. 2005]. These relationships form context in the environment containing the objects. This context allows us to dynamically adapt our projection, for example, it enables applications such as a game which is only projected on a set of objects when they are all brought together into the same location. 7.2 Smart Photograph Album The Photograph Album is a manufactured object designed specifically to demonstrate three interaction methods in the cooperative augmentation framework: 1. Manipulation of object location 2. Interaction by object geometry manipulation 3. Interaction with projected user interfaces The object itself is a physical book, but is given a new appearance using a dustcover on the outside, and a new appearance inside with a pasted-in page. A Smart-Its device attached to the album detects when it is moving with accelerometers, and detects when it is opened with a light sensor. This scenario involves two users, one of whom has just returned from a recent holiday and would like to show the second user their 142 7.2 SMART PHOTOGRAPH ALBUM photographs. Similar to the Magic Book demonstrator developed by Billinghurst et al. with fiducial-marker based AR [Billinghurst, Kato et al. 2001], when the book object is detected by the projector-camera system it becomes automatically augmented with the photograph display. 7.2.1 Object Model The Object Model appearance knowledge is initially trained with images of the Photograph Album captured with the object placed orthogonal to the camera at 3m distance, and manually annotated with an object bounding box. Two images are captured – the front cover and inside when opened, representing two possible object geometries (closed and open). See Figure 7.5 for an example of the front and inside. Local features are extracted from the annotated bounding box regions using the SIFT detection algorithm. 3D Model and appearance states are defined for when the object is open and closed, as shown in Figure 7.6. The object uses interactive projected buttons to browse left and right through images in the album. We constrain the interface so users cannot press left and right simultaneously, which allows us to model both buttons with a single button state. The projected left and right buttons now simply return the number of the next image to change to. The buttons are modelled as interactive areas, in which the user’s finger is detected and matched to a fingertip template by the detection application. The location of the interactive areas is specified as surfaces defined in the object’s 3D Model. The “Content Model” is a hierarchy of albums, images and buttons which defines the image to project, based on the button state and whether the object is open or closed. We consider each projection a separate state with its own “Display Model” stored in the Content Model. Each Display Model stores the button locations and button state value, together with the state value the buttons return when pressed, as shown in Figure 7.7. There are 3 albums, each with an image for the front cover, with the name of the album superimposed on the top. Each album has 5 images for projection inside when the album is opened, representing individual button states (see the numbers in Figure 7.7). For each projection, the left buttons are defined in the Display Model to return the value of the image to the left of the current image. Similarly, the right buttons return the value of the image to the right. The values wrap around, so pressing right from image 5 returns to image 1 and pressing left from image 1 goes to image 5. Figure 7.5 Photograph Album smart object with Projection (left) front cover, button state 10 (centre) being opened, (right) inside, button state 5 143 7.2 SMART PHOTOGRAPH ALBUM 7.2.2 Registration and Detection Two users enter the environment with the first carrying the smart Photo Album. The object registers and transmits its Object Model to the projector-camera system. The smart object detects it is moving with accelerometers, and detects it is closed with the light sensor. The object evaluates its 3D Model configuration states and broadcasts it is in state 1 – Book Closed, sending the projector-camera system a 3D Model of the closed object and a the appearance description of the cover of the album. The projector-camera system uses the appearance description to search for the object in the camera image. The first user places the object down on a table and it detects it has stopped moving. The camera detects the closed object on the table and calculates its location and orientation. 7.2.3 Projection and Manipulation The object requests projection of the first album image (image 0 in Figure 7.7) on the album cover. The first user touches the right button. The user’s finger is detected in the interactive area by the camera system, activating the button and changing the button state to 10. The Object receives the new button state and requests projection of the second album image (image 10 in Figure 7.7) on the album cover. The first user opens the cover of the album, which appears to the object as a change in light level value. The object evaluates its 3D Model configuration states and changes to state 2 – Book Open, updating the projector-camera system with a new 3D Model of the open object and a new appearance model of the inside of the album. The updating of 3D Model geometry automatically removes the projection on the cover. The object requests projection of image 11 on the inside of the album. The first user passes the album to the second user, sat around the corner of the table. The projection follows the object movement, so it appears attached to the object. The second user browses through the images using the left and right buttons and then closes the object. The close action is again detected by the change in light levels and the 3D Model configuration state returned to 1- Book Closed. The projection returns to image 10. The first user picks up the album and exits the environment. 7.2.4 Discussion The photograph album application demonstrates projection on the surface of the smart object. Users can interact directly with the projected buttons on the object to change the projected content. The object senses physical changes such as movement and manipulation of its articulated geometry, such as opening the book. Both factors have an impact on the detection process. In a vision-only approach any change in appearance or form of the object can cause detection failure. In our approach the object updates the projector-camera system with new appearance knowledge when it detects it has been opened, enabling tracking to continue uninterrupted. Hence, sensor information supports detection robustness. In the demonstration implementation there are 3 albums and each album has 5 images for projection inside when the album is opened. However, as each image requires its own button state in the Photograph Album model, the number of albums and images is only limited by the storage requirements (of both Object Model and images) and the 16bit integer used for representing individual button states (a total of 65,535 images). 144 7.2 SMART PHOTOGRAPH ALBUM This application demonstrated that a user can interact with an object while being detected, tracked and projected. However, several practical issues emerged from the finger detection method. It was found both the accuracy of pose calculation and jitter has a large effect on the finger detection method. For button activation our fingertip detection method requires a motion gradient towards then followed by away from the button location, combined with a matched semi-circular fingertip template within the button area. However, when extracting images of the active button detection area any jitter or incorrect pose calculation causes an offset area to be extracted. When used in background subtraction this image easily causes a motion gradient to be established. If the area extracted also contains either projection or the object’s surface contains texture which appears rounded like a finger, then a false positive occurs. Similarly, if the image extracted is offset too far from the correct detection area it can cause a real finger interaction to be missed (a false negative). In the photo album demonstrator we initially projected a rounded arrow into each of the button areas, but this was found to give too many false positives and the projection had to be changed to a different triangular arrow appearance. This has important implications for this interaction method in the architecture, as it suggests either this method is not suitable for use in conditions with high jitter or high chance of incorrect pose calculation (such as with moving objects), or a new more robust method for finger detection is required (one possibility would be the pattern oriented detectors proposed by Borkowski et al. [Borkowski, Crowley et al. 2006]). 145 Object Model UID Sensors 3DModel Configuration Movement Configuration Content Configuration 0 – Closed Operation Min 0 0 AND Sensors Accel X Accel Y Accel Z Light Button Sensor State Max Light Button State Operation Min 0 Value 1 – Book Closed Max 255 Max 2 3DModel Configuration Sensor Light 3D Model 2 – Book Open Appearance State Operation Min 3 Value 255 1 Max … Sensor Light 3D Model 2 11 Project 11.jpg on inside AND Max Appearance 1 – Open AND 3 1 Operation Min Value Value Combination Method 2 9 Content Configuration State Sensor Light Button Value Value Project 0.jpg on front Project 1.jpg on inside Value Value Operation Min Action Sensor 11 – Open Max Light Button Action Combination Method 255 11 State 0 10 Operation Min Value Value AND Project 10.jpg on front 2 19 10 – Closed Combination Method Action State Sensor Light Button Action Combination Method 7.2 SMART PHOTOGRAPH ALBUM front inside Operation Min 1 – Not Moving Max 0 2999 0 2999 0 2999 OR State Variance Variance Variance 2 – Moving Operation Min Max Sensor Variance 3000 9999 Variance 3000 9999 Variance 3000 9999 OR Combination Method Accel X Accel Y Accel Z State Combination Method Accel X Accel Y Accel Z Sensor Movement Configuration Figure 7.6 Partial Object Model representation of Photo Album Demonstrator 146 0 20 10 Open/Close Open/Close Open/Close 22 21 3 23 13 Figure 7.7 Photograph Album Content to Project 12 2 11 1 24 14 4 7.2 25 15 5 147 SMART PHOTOGRAPH ALBUM 7.3 7.3 SMART COOKING Smart Cooking This demonstrator is designed to illustrate multiple objects augmented with displays from multiple projector-camera systems. Here we use a task based cooking scenario where smart objects cooperate in order to provide the user with context-dependant recipe instructions to boil an egg. We motivate this demonstrator with an scenario similar to the MIT Smart Kitchen project CounterIntelligence [Bonanni, Lee et al. 2005], which distributed the recipe instructions into the environment, with an interactive projection of a recipe book onto the work-surface. The user followed the recipe and instructions would be projected at the correct time on the work-surface, using a smart-environment approach with cameras and sensors to detect interaction distributed in the kitchen. In our approach we focus on projection directly on the objects, rather than projection at a central location in the environment. This is one of the main characteristics of the Cooperative Augmentation approach, as it allows revealing of hidden knowledge inside the object (such as its temperature) or direct feedback to user interaction with the object, where it belongs spatially in the real-world. Hence, instead of taking a smart-environment approach we take a smart object approach, by making use of sensors embedded in the individual objects themselves (specifically a smart recipe book, smart egg-box, smart pan, smart salt, and smart stove). These sensors are used to infer the current context and project the next instruction step onto both the recipe book and on the objects themselves. The smart recipe book contains a series of recipes, each with printed photos of how the meal looks. The user can move around the environment (with the book being tracked with the steerable projector) and select a recipe by turning the pages of the book (similar to the scenario described in section B.2). Each recipe states only the ingredients required and the average time it takes to cook. To select a recipe, the user touches a projected interactive “cook” button on the smart recipe book. Instead of the smart environment approach, with a single object (the recipe book, or the environment) controlling the interaction, we break down the cooking task into sequential sub tasks and distribute the tasks to the smart objects in the environment. When a recipe is selected in the book, each object involved in the recipe is dynamically programmed with the sub-tasks by the recipe book. Task distribution and object programming can be performed automatically from the task description using the RuleCaster approach by Bischoff and Kortuem [Bischoff and Kortuem 2006]. 7.3.1 Hard-Boiled Egg Recipe The task we chose to illustrate the kitchen scenario was to hard-boil an egg. The recipe can be split into 3 parts: ingredients, equipment required and the method. As it is challenging to augment an individual egg due to its size, we instead augment the eggbox with a sensor node. Ingredients 1. Eggs (inside egg-box) 148 7.3 SMART COOKING 2. Salt 3. Water Equipment 1. Pan 2. Stove 3. Sink Method 1. Add egg from box to saucepan 2. Add cold water from sink to pan, covering the egg 3. Place pan on stove 4. Add a pinch of salt to water 5. Turn on stove, to highest temperature setting 6. Wait until water is boiling. 7. Turn down stove to simmer for 7 minutes 8. Turn off stove 9. Empty hot water in sink 7.3.2 Object Model In this demonstrator implementation we simulate the RuleCaster dynamic programming by developing Object Models for the chosen task and embedding them in each object by hand. The sequential sub-tasks of the recipe are considered as individual states of the overall recipe. These states are numbered and used to synchronise the recipe across the smart objects. Initially, the recipe book broadcasts recipe state value of 0. Objects assigned sub-tasks sense the recipe state on the network and begin their subtask when the state variable is the correct value. Each object completing a sub-task increments, then re-broadcasts the state variable on the network. We develop a set of rules which form the application logic for each task. These are represented in the Object Models as the Content Models (Content Models are described in 7.2.1). Note: The complete content models of the egg-box, pan, salt and stove objects can be seen in Figure B.4 to Figure B.8, grouped by recipe state variable and arranged sequentially in recipe state order 0-12. We present only excerpts from these figures suitable for understanding our demonstrator in the task procedure description below. 7.3.3 Task Procedure We assume the egg-box, pan, salt and stove are all present in the environment with the recipe book. The egg-box uses a light sensor to detect when it is open (high light level) and closed (low light level). The pan uses a force sensor to detect the weight of the contents, when it is placed on a surface (high force value) and when it is picked up (0 or very low force value). Additionally, the pan uses a temperature sensor to detect when the pan is at the correct temperature for cooking the egg. The salt uses a force sensor to detect when it is picked up. The stove uses a gas knob position sensor (simple variable resistor voltage value converted 0-100% with the Smart-It’s A-to-D converter). 149 7.3 SMART COOKING We use one steerable projector-camera system centrally mounted in the environment and one fixed projector-camera system with a view down onto the work-surface to detect the objects. When the user stands in front the work-surface they occlude the steerable projector, hence we use the fixed projector-camera system for detection and projection in this location. Objects moving between the two systems experience a short loss in tracking when moving between the two projector-camera systems, as the user typically occludes the steerable projector before the object enters the work-surface projector’s field of view. Recipe State 0 When the initial state is broadcast by the Recipe Book, the egg-box requests a projection of “Add egg to pan”, whether it is open or closed, as shown in Figure 7.8. We use the pan’s force sensor to sense when an egg is added to the pan. We assume eggs weigh over 30g (typically eggs weigh 35-75g). Consequently, the pan requests a projection of “Add egg to pan”, until the force detected increases over 30g. When this occurs the projection is removed and the new recipe state 1 is broadcast. Egg Box Pan 0 State Sensor Recipe Light 1 - Closed State Operation Min Value Value Max Sensor 0 2 Recipe Force 0 0 Combination Method Sensor Recipe Light 2 - Open Value Value Combination Method 0 2 Value Value Max Sensor 0 9999 Recipe Force 0 29g AND 2 – Egg Added Operation Min Value Value Combination Method AND Action Project “Add egg to pan” Max 0 2 Action Project “Add egg to pan” State Operation Min Operation Min Combination Method AND Action Project “Add egg to pan” State 1 – Empty 0 30g Max 0 9999 AND Action Remove Projection, Send 1 Figure 7.8 (left) Recipe State 0, (right) Add Egg Projection inside pan Recipe State 1 The projection on the egg-box is removed automatically when state 1 is broadcast. The egg-box plays no further part in this scenario. The pan now requests a projection asking the user to “Add cold water to the pan, to cover the egg”. When the force decreases, we infer the object has been picked up. When the pan detects the temperature has decreased significantly below ambient (20º is assumed as ambient) we infer the user has added cold water to the pan. The pan projects “Place pan on stove” and the new recipe state 2 is broadcast. Recipe State 2 The pan monitors the force sensor and the distance to the stove. When it detects a high force value again and the distance to the smart stove object is less than 0.25m, it infers it must have been placed on the stove. The pan removes the projection and the new recipe state 3 is broadcast. 150 7.3 SMART COOKING Recipe State 3 The pan requests the projection “Add a pinch of salt” throughout this state, while it is on the stove. The salt object requests the projection “Add a pinch of salt” if it detects it is put down (static), distant from the pan. When the salt detects it has been picked up (low force value) and is within 0.25m of the pan the new recipe state 4 is broadcast. Recipe State 4 The salt removes the projection when it detects it has been put back down and the new recipe state 5 is broadcast. Recipe State 5 The “Add a pinch of salt” projection on the pan is automatically removed. The stove object requests a projection “Turn On Stove, to 100% setting” if the stove setting is less than 100%. When the stove setting is turned to 100%, the projection is removed and the new recipe state 6 is broadcast. Recipe State 6 The pan object projects the temperature from the temperature sensor and a “Wait until water boils” message as long as the pan is on the stove and the temperature is below 100º. When the temperature reaches 100º the new recipe state 7 is broadcast. Recipe State 7 The stove object requests a projection “Turn Down Stove to Simmer (40% setting)” if the stove setting is outside the value range 31 to 49%. When the stove setting is within the value range 31 to 49% the projection is removed and the new recipe state 8 is broadcast. 8 Pan Stove State 5 – Pan Too Cold State Sensor Operation Min Recipe Value 8 Object Pan Temp 0 Combination Method Max 8 90º AND 6 – Pan Too Hot Max 8 9999 AND Action Project Pan too hot, adjust to simmer State 7 – Pan Correct Temp Sensor Operation Min Recipe Value 8 Object Pan Temp 91 Combination Method Action Operation Min Max 8 98º Max Recipe Value 8 8 Force Value 30g 9999 0 0.25m 0 7min Time Value AND Combination Method Sensor Operation Min Recipe Value 8 Object Pan Temp 99 Combination Method 9 – Simmering 7 mins Object Stove Distance Action Project Pan too cold, adjust to simmer State Sensor Action Project countdown timer State Sensor 10 – Simmering 7 mins Operation Min Max Recipe Value 8 8 Force Value 30g 9999 0 0.25m Object Stove Distance Time Value Combination Method 7min 7min AND Action Remove Projection, Send 9 AND Remove Projection Figure 7.9 Recipe State 8 Recipe State 8 The stove regulates the pan temperature by requesting the temperature from the pan object. If the temperature value reduces below 91º the simmering will stop, so the stove requests a projection “Pan too cold – turn up stove to simmer”. If the temperature value 151 7.3 SMART COOKING increases above 98º the pan is boiling, so the stove requests a projection “Pan too hot – turn down stove to simmer”. When the temperature is between 91 and 98º any projection is removed, as shown in Figure 7.9. The pan object requests projection of a countdown timer for 7 minutes while the pan is simmering on the stove. The timer can take the form of a 7minute video of a countdown. When the timer ends the projection is removed and the new recipe state 9 is broadcast. Recipe State 9 The stove object requests a projection “Turn Off stove” while the stove setting is above 0%. When the stove is turned to 0% (off) the projection is removed and the new recipe state 10 is broadcast. Recipe State 10 The pan object requests a projection “Empty hot water into sink” while the object is static on the stove and not near the sink. When the pan detects it has been picked up, is over 0.25m from the stove and closer than 0.25m to the sink the new recipe state 11 is broadcast. Recipe State 11 When the object detects it is picked up and the temperature has decrease significantly below the simmering temperature, it infers the water has been emptied. The pan requests a projection “Enjoy your meal!” and the new recipe state 12 is broadcast. Recipe State 12 When the pan object is placed down on a surface the projection is removed. This completes the recipe. 7.3.4 Discussion The CounterIntelligence project [Bonanni, Lee et al. 2005] used a centralised projection approach in their smart environment, where the projected recipe appeared at one location on the work-surface in the kitchen environment. This provides a single point of focus for the user, who always knows where the current recipe instruction is located. Similarly, we can mechanically align our projector to the fixed projection surface to eliminate any geometrical distortion. However, we can illustrate one of the downsides of this and the smart environment approach if we imagine a hot pan on the stove. We want to warn the user that the pan is hot (as shown in the CounterIntelligence project), but because the temperature sensor and projector is located over the stove/work-surface area we can only project the temperature onto the pan when the pan is on the stove. If we take the pan off the stove it is still hot, but we can no longer sense this or project the warning onto the pan. In contrast, the approach taken in this thesis by the Cooperative Augmentation framework is to project directly onto the objects and use their sensing capability. One major benefit of our approach is that we obtain knowledge of the object’s geometry from the smart object. This is significant as it allows us to ignore surrounding surfaces and use projection whenever the object is visible to the projector, in any environment in which a projector-camera system is located - not just those with nearby surfaces suitable to projection. In the smart cooking scenario this approach allows us to provide visual feedback directly on the objects, where it belongs, whenever they are in view of a 152 7.3 SMART COOKING projector-camera system. In this case a hot smart pan senses its own temperature continuously and can request a warning projection whenever it is hot enough to burn. This projection will appear fixed to the object as it is moved around the environment by using different projector-camera systems to detect, track and project on it. 7.3.4.1 Coping with Objects Unsuitable for Projection A number of the objects in the smart cooking scenario are physically small, such as the eggs and the salt. As described in section 5.3, small objects are challenging to detect and track accurately. Similarly, small objects have little space for projection. One solution supported by our architecture is to project onto other nearby smart objects (as described in section 7.1.8). Figure 7.10 (left) Example salt 3D model, (right) 3D model with added base projection area It is also possible to extend an object’s 3D model by making assumptions. For example, we can extend the 3D model of the salt to include a large rectangular area next to, or centred at the base of the cylinder, as shown in Figure 7.10. This would allow projection around the cylinder of the salt itself, but assumes the salt is stood with its base on a flat surface and there are no other objects occluding the projection area. Similarly, the methods for geometry recovery discussed in section 2.3.6 could be used to dynamically model surfaces around the object to allow projection. However, displays created next to the object have the drawback that the display is no longer seamlessly integrated with the object. Hence, the user may find the display projected on the surface of random objects in the environment. This breaks the association with the original smart object which is implicit when projecting on its surfaces, and could cause users to become confused when displays appear in an unexpected spatial location, as reported by Sukaviriya et al. [Sukaviriya, Podlaseck et al. 2003]. 7.3.4.2 Implications of Dealing with Multiple Objects The most challenging aspect of dealing with multiple objects in this scenario was their synchronisation, so that each object was aware of what to sense, which states it could change to, and what to project at any point in time. This has an important implication for our framework, as any scenarios with multiple objects and involving an aspect of time will face the same problem. Here we introduced a relatively naïve solution to synchronisation by having each object broadcast a shared global variable, updating the sequential recipe state in all objects. However, this method can only be used in scenarios where the objects are known and fixed a-priori; it does not allow another object to be spontaneously incorporated in the recipe without a global re-programming. 153 7.3 SMART COOKING There is great scope for failure in this demonstrator, as we consider the recipe as a series of states. Any failure to sense a particular state correctly, such as false negatives where an event is not detected (e.g. adding the egg to the pan), or false positives where an event is detected even though it didn’t happen (e.g. the salt weight sensor detects it is picked up due to a low weight reading, but in reality it is just empty) will cause a failure in the recipe. The user may be able to proceed from a false positive, as the recipe should wait for the objects to enter the next state, however if an event is never sensed (false negative) due to incorrect Object Model sensor threshold values, incorrect object programming or a hardware failure, then the recipe will stall and the user will not be able to proceed. In practice, the current smart cooking demonstrator is not particularly robust, but still serves as a good concept illustration. 7.3.4.3 Time Our architecture currently does not address the concept of time, hence in this scenario when the smart pan object has to wait until the egg is cooked we project a video which lasts for the required length. This method has a disadvantage in our scenario; due to our current implementation of video playing in the architecture it cannot be paused if the pan is taken off the stove (so that cooking stops) and then re-started from the same position when the pan is replaced on the stove. Instead, in the current implementation the video would be re-started from the beginning. To explicitly support time a specific timer capability would have to be implemented either in the smart object, running in parallel with the Object Model processing, or on the network. The timer could be modelled to stop and start by event messages sent from the content model states and its state queried as a normal sensor in the framework. 7.3.4.4 Dynamic Object Programming of Multi-Purpose Objects The physical form and external appearance of the majority of smart objects will typically remain constant over time. Similarly, the number and type of sensors physically embedded in the object will remain constant. However, the use of an object may change frequently. For example, the salt can be used in a variety of ways in cooking (added to water to increase boiling temperature, added to sauces, on fish and chips, etc…) or possibly outside the home to spread on the ground to stop ice forming on paths. Hence, we need a method to allow multi-purpose smart objects to be automatically included in different scenarios. The cooperative augmentation framework provides a method for updating an object’s Object Model dynamically, for example, in response to new appearance knowledge extracted by the detection process. However, we are not restricted to just updating appearance knowledge. Multi-purpose objects can be dynamically programmed by updating the Content Models in the Object Model. Each of the ingredients and pieces of equipment in the kitchen scenario are multi-purpose objects. Hence we use automatic task distribution and object programming directly from the task description using the RuleCaster approach by Bischoff and Kortuem [Bischoff and Kortuem 2006]. However, as this re-programs the whole content model, this approach only allows objects to be part of a single scenario at time. To allow an object to be truly shared dynamically among a number of uses would require a method of fusing content models that avoided conflicts, such as content models from two scenarios requesting different projections on the same surface at the same time. 154 7.4 7.4 CONCLUSION Conclusion In this chapter we presented three demonstration applications to evaluate the Cooperative Augmentation architecture implementation described in Chapter 6. A feature shared by all three demonstrators was that the knowledge was distributed in the objects, not in the environment. Our Cooperative Augmentation approach uses smart objects rather than smart environments to achieve projected displays on object’s surfaces. In the smart cooking scenario this meant all the objects cooperated to monitor the cooking process, rather than a single controlling object or the environment. The smart chemical container actively increased its knowledge by extracting more appearance information, demonstrating re-embedding of knowledge into the object. Additionally, the smart chemical container objects supported direct interaction via our architecture, while simultaneously monitoring their workflow. For example, the projection was displayed when the user misplaced the object and stopped when the user corrected the container’s position. Here the object actively controls the projection based on the states pre-defined in its Object Model. In contrast, the smart photograph album changed the photograph displayed when user interaction was detected via finger sensing, demonstrating another interaction modality of our architecture. The use of sensing was shown to greatly benefit the detection process. In the smart chemical container demonstrator, movement sensing constrained the detection process when detecting moving objects and discriminated between objects with identical appearances. In the smart photograph album sensing detected user manipulation of the object geometry and allowed the object to update the projector-camera system with new appearance and geometry knowledge for uninterrupted tracking. We identified limitations in projection due to low resolution projection achievable on the smart chemical container. Hence, we suggest objects should incorporate an adaptation mechanism to adjust what they project based on the achievable display resolution. For example, if only a low resolution is available then detailed text could be replaced with successively more abstract information with larger font size, or a pictogram or symbol. From the smart cooking demonstration we also found that some objects that are very difficult to augment with embedded sensing and computation either due to their size, shape, their use or their nature. For example, the eggs in the smart cooking demonstrator are disposable objects, hence it would not likely be economically feasible to augment every egg with embedded sensing and computation. Similarly, we found that some object surfaces are not suitable for projection (such as very small, transparent or reflective surfaces). For example, the eggs in the smart cooking demonstrator cannot easily support projection of complex information due to their size. In the case of reflective surfaces a coping method has been proposed using multiple projectors [Park, Lee et al. 2005], however this requires known surface geometry and explicitly tracked users to calculate the projection required to minimise reflection. These last two findings are very significant for our approach, as they suggest there is likely to be a class of everyday real-world objects which cannot be used with Cooperative Augmentation. 155 8.1 Chapter 8 CONTRIBUTIONS AND CONCLUSIONS Conclusion The primary objective of this thesis has been to develop a means for smart objects to cooperate with projector-camera systems to achieve a display capability without adding dedicated tracking hardware to every object or changing the natural appearance, purpose and function of the object. Displays allow any smart object to deliver visual feedback to users from implicit and explicit interaction with information represented or sensed by the physical object. This feedback has the potential to provide a balanced interface, as we support physical objects as both input and output medium simultaneously. This contributes to the central vision of Ubiquitous Computing by enabling users to address tasks in physical space with direct manipulation and have feedback on the objects themselves, where it belongs in the real world. 8.1 Contributions and Conclusions The core contribution of this thesis is the development of a new approach called Cooperative Augmentation which enables smart objects to cooperate with projectorcamera systems to achieve a display capability. This framework formalises a mechanism for smart objects and multiple distributed projector-camera systems to cooperate to solve the object’s output problem and achieve interactive projected displays on their surfaces. The Cooperative Augmentation framework describes the role of the smart object, projector-camera system and the cooperative process. Specifically, the approach embeds the knowledge required for detection and projection in the smart object. This allows us to assume projector-camera systems offer generic projection services for all smart objects. To program the smart object we develop a state-machine programming method. The smart objects were modelled to give the projector-camera system enough information about their appearance, form and capabilities to be able to detect and track them visually. Object sensors are modelled to allow the projector-camera system to make use of movement sensors for constraining the detection process and to make use of other sensors for interaction detection. In Chapter 5 we show that use of this sensing in detection provides a significant improvement in detection performance. While low-level, our programming method is lightweight, easily conceptualised and requires little information beyond which sensors we want to use, how to combine them and what sensor ranges we want to assign to a particular state. This allows smart objects to change projections in response to sensor events either sensed directly by the object, or remotely (for example, the movement of another associated object). 156 8.1 CONTRIBUTIONS AND CONCLUSIONS An architecture for the Cooperative Augmentation framework was implemented and experimentally evaluated in Chapter 6 to validate the concept. We present three demonstration applications using the architecture in Chapter 7, developed to demonstrate three different aspects of the architecture. The applications showed that smart objects with different sizes, shapes and appearances could be successfully detected and tracked by the projector camera system. Additionally, they demonstrated that using our framework the different geometries and surface colours could be corrected so the projection appeared undistorted and visible to an observer. Hence, we conclude that we achieved our original goal of enabling interactive projected displays on smart objects. A central problem to achieving displays on smart objects is their detection and tracking. We investigated vision-based object detection by performing the two experimental studies presented in Chapter 4 and Chapter 5. We present seven insights from these chapters: 1. When investigating natural appearance vision-based detection algorithms we found that the use of scale and rotation invariant algorithms provides performance benefits over single scale and rotation-variant algorithms, without loss of discrimination. Hence we use them in our example framework implementation in Chapter 6. 2. We also learned that detection performance varies when the algorithm is trained at different distances and found the best training distances for each algorithm based on our object appearance library dataset. This is important, as this knowledge allows us both to help determine the best algorithm to use when detecting the object at runtime, and informs us of the best distances to extract extra appearance knowledge in the learning process for maximising detection performance. 3. When investigating rotation we found that an object’s appearance changes when it is rotated in 3D (i.e. not in the camera plane), but this change can be much greater than with scaling, as whole surfaces can appear or disappear. Hence, to robustly detect an object in any 3D pose we found that we need to train our detection algorithms with multiple viewpoints. 4. The results of the cooperative detection experiment in Chapter 5 show a very clear performance gain for all four natural appearance detection cues when we added movement sensing in the object, indicating the combination of different sensing modalities is important in detection. The use of movement sensing constrained the search space, which translates to increased robustness to clutter and distractions in the real-world environment, and helped to discriminate between objects with identical appearances. The improvement was seen for all algorithms, suggesting this can be generalised to other detection algorithms. 5. We found that with the use of movement sensing, simple algorithms achieve similar or better detection performance to complex detection algorithms and that the runtime of the majority of algorithms was reduced. This has important implications as it means performance can be maintained or improved while reducing the amount of processing power required for detection, or that overall detection speed can be increased. 6. The results of the cooperative detection experiment confirms that different objects require different appearance representations and detection methods, validating our proposed multi-cue approach. 157 8.2 BENEFITS OF OUR APPROACH 7. The use of multiple cues improved detection performance for all objects; however, little performance improvement was seen when using more than the best 2 cues. The percentage improvement in detection performance was also generally greater when using movement sensing. These findings indicate that any environment using vision detection would greatly benefit from the fact that smart objects cooperate, by both informing the environment about object-specific knowledge (such as the object’s appearance) and embedded sensor readings. Appendix A provides an additional contribution by analysing the benefits and drawbacks to different designs of steerable projector and illustrating the construction of both a moving-head and moving-mirror prototype. The two prototypes were built from components, which involved both the choice of the appropriate technology and the subsequent integration of the technology with the computer system. The objective was to develop a system enabling detection and tracking of mobile objects and interactive display of graphical user interfaces anywhere in the environment. Of the available technologies presented in Chapter 2, we consider computer vision detection combined with steerable projection best able to satisfy this objective. While our steerable projector-camera systems have similar abilities to related projects with steerable-projectors [Pinhanez 2001; Borkowski, Riff et al. 2003; Butz, Schneider et al. 2004; Ehnes, Hirota et al. 2004], the equipment design is original. Specifically, for our moving head steerable projector we use a single arm yoke for pan and tilt, rather than twin-arms, allowing easier calibration of video projectors which have horizontally offset lenses (i.e. the majority), as the projector can be attached so the projector’s centre of projection is close to the centre of rotation of the pan and tilt unit. 8.2 Benefits of our approach With the Cooperative Augmentation approach we aim to make life easier for both the user and the smart object. By displaying information on objects, where it belongs in the real world, we can make use of human spatial memory to aid interaction [Butz and Krüger 2003]. Similarly, interfaces tightly coupled to an artefact’s surfaces can allow a direct mapping for interaction, which may reduce the level of indirection and hence the cognitive load. For example, the orientation of an artefact could be used as a direct input method while the projected display provides the user with instant feedback to manipulation of the object itself, such as the change of a highlighted option. With projected displays users are no longer constrained to single static location displaying information about smart objects. Instead, objects are tracked and the displays follow the objects. Steerable and mobile projector-camera systems further enhance and encourage the mobility of the objects, so users no longer have to sit in a specific location to read an interactive projected book. The use of projected displays also allows multiple people to see the display and interact simultaneously. Objects can be more descriptive themselves, either directly or indirectly. For example, they can directly label themselves with their names in a foreign language to help students learn [Intille, Lee et al. 2003], or project assembly instructions directly onto pieces of furniture to decrease difficulty and reduce errors in the assembly task easier for novice users, as described in section B.3. 158 8.3 LIMITATIONS This approach provides a possible solution to one of the typical problems for users in Ubiquitous Computing environments – knowing which objects are smart and can be interacted with. As objects can label themselves, projected displays support easy discovery for users. One example application would be a searchlight application, which would detect all smart objects in the environment and reveal information about their status or supported interaction methods on their surfaces. In addition to making inherent object properties explicit (where the invisible becomes visible) we can use the displays to reveal or visualise state changes in the object. For example, a smart object configuration and debugging display that allows programmers real-time modification of the smart object’s state configurations. Similarly, displays can be used for inspection, to reveal hidden contents or relationships. For example, two objects without a built-in location system can be associated and display content linked to a particular state where they are in proximity. These displays can also visualise information stored in the object such as a record of use, for example, how a wine bottle was stored while maturing. Our approach also has potential benefits for employers and employees as it could replace steps in work processes. For example, it has been demonstrated that smart objects can embed rules from the large rulebooks for safe handling of chemical containers [Strohbach, Gellersen et al. 2004]. Employees are currently responsible for safe handling, but may not remember all the rules in detail. By using smart containers and projection services, there is the potential to either remove some of the responsibility from the employees or make the interaction faster and simpler for the employee. Safety critical storage locations could be monitored by the chemical containers themselves and warn the employees by visual notification when a container is stored outside the approved storage area or near to reactive chemicals. In this case the job of the employee is made much easier if the container can identify itself to the employee visually and guide the employee to the correct location. 8.3 Limitations While our work provides a proof of concept for the Cooperative Augmentation framework, there are a number of technical limitations which would make its direct deployment in the real-world infeasible. We aim to identify some of these limitations in this section and suggest possible solutions. Robust vision-based object detection in the real-world is a hard problem and the visual detection system we implemented is not a perfect detection system. Consequently, many objects achieve less than 100% detection performance with our natural appearance methods due to problems with scale, rotation, defocus blur, fast motion blur, partial or total occlusion, lighting and shadows or distracting objects. Practically, this means an object may not be immediately detected on entry to an environment (if at all). However, as we demonstrate with the cooperative detection experiment in Chapter 5, there are ways to improve the detection performance without writing new improved detection algorithms, by making use of embedded movement sensing in the object. However, there are some limitations to an approach only involving movement. To detect static objects, the only way we can use movement sensing is to exclude moving areas in the image from the detection process. While this confers some similarities to 159 8.4 FUTURE WORK the scenario where there is a moving object and static background, the detection performance will vary between performance without sensing and with sensing, depending on the size of the moving area in the image. In this case, greater knowledge of the object’s context from other sensors may help us detect the object, for example, using an embedded electronic compass or any other available location system. Similarly, if the object has external light sensors a structured light approach would also be possible, as discussed in section 2.7.2. The smart chemical container demonstration application in Chapter 7 raised another limitation of our system, which was the resolution of the projection on the chemical containers. The projector we used was an XGA (1024x768 pixel) projector, and it created a relatively low resolution 200x130 pixel display with 1mm2 pixels at a distance of 3m when set to mid-zoom. To increase this resolution either a higher resolution projector can be purchased (High Definition 1920x1080 pixel projectors are now common), or the projector zoom can be used dynamically to zoom-in, with the trade-off of reducing the field of view From the smart cooking demonstration application we found that some objects are very difficult to augment with embedded sensing and computation either due to their size, shape, their use or their nature. Similarly, we found that some object surfaces are not suitable for projection (such as very small, transparent or reflective surfaces). These two findings illustrate a significant limitation of our approach, as they suggest there is likely to be a class of everyday real-world objects which cannot be used with Cooperative Augmentation. 8.4 Future Work The three main challenges in our framework implementation are robust visual detection of smart objects, correcting the projection for non-planar geometry and nonideal surface colour, and keeping the projection synchronised when the object is moved or manipulated. These are the areas where improvements to the implementation would bring the most benefit. For detection specifically, more research is required on how different combinations of training knowledge and sensing change the detection performance, which computer vision algorithms are best suited to detecting the objects and to characterise what impact different movement sensor types have on robustness of detection. The question is also how to make it more robust? We have seen that movement sensors embedded in smart objects can provide a significant increase in performance, so perhaps this can be generalised to other types of sensors. In this case the question would be which sensors, and how many more sensors would help? As section 2.7 illustrated, it has been shown that both inertial and light sensors are valuable in the detection and pose-calculation process, but there should be a comprehensive study of which sensors add the most value and which sensors could be best combined. For example, if the object has embedded inclinometer or 3D accelerometers, we know this can help in pose calculation, but only if we know our camera orientation relative to gravity vector (as discussed in section 5.3.4). The detection system in the framework implementation can optionally make use of the graphics card GPU for speeding up complex image processing tasks. Typically the performance increase of 10-15% is seen over the CPU algorithms. However, the CPUGPU transfer time is currently a major bottleneck, especially for small amounts of data 160 8.4 FUTURE WORK such as results to reduction operations or control data for the image processing algorithms on the GPU. Ideally, the solution is to move as much processing as possible to the GPU to optimise performance, while only performing initial upload of a camera image and download of the final detection result. However, only relatively few of the complex computer vision algorithms we use have been transferred to the GPU [Strzodka, Ihrke et al. 2003; Montemayor, Pantrigo et al. 2004; Cabido, Montemayor et al. 2005; Fung and Mann 2005; Klein and Murray 2006; Sinha, Frahm et al. 2006; Leiva, Sanz et al. 2007] and many must still be written or further optimised to best make use of the parallel stream processing paradigm. Both the detection approach and projective texturing geometric distortion correction method we use rely on a known 3D model of the object. While it is not a stretch to assume that ‘newly purchased’ smart objects brought into the environment would automatically have knowledge of their appearance and form, it does not address the problem of smart object prototyping or of legacy objects which we may possess and make smart ourselves by adding computation. These objects have no 3D or appearance models. While one possible solution would be to purchase commercial 3D object scanning equipment, this is an expensive option. In contrast, by using computer vision two approaches are readily available – firstly by using structured light with a projectorcamera system, for example, the approach proposed by Borkowski et al.to detect planar surfaces in the environment with a steerable projector [Borkowski, Riff et al. 2003], or the approaches discussed in section 2.3.6. Secondly by using structure-from-motion techniques, either with direct affine-invariant local feature modelling [Rothganger, Lazebnik et al. 2006], or by using a SLAM approach [Davison and Murray 2002; Davison 2003; Chekhlov, Gee et al. 2007; Klein and Murray 2007]. Here we can imagine a user would only need to hold the object up to the camera and rotate it for a 3D and appearance model to be dynamically created. Open questions also remain in the area concerning location of projections on an object. Specifically, if an object does not care about the exact projection location, or would always like the most visible, readable and useable projection, this would involve the framework automatically changing the location of the projection based on object visibility or orientation. The challenge here is how to determine the best strategy to ensure the most visible location on an object’s surfaces for the user is used? We would likely either need to track the user, or make assumptions about the user’s location. We address scalability in our architecture by using a concept of an “environment” which is equivalent to a defined group of projectors and cameras contained in a constrained spatial area, such as a room. Similarly, for steerable projector-camera systems we introduce a concept of “system focus” to determine which object to track when multiple objects are present. However, it is still unclear how scalable these approaches are and how many objects can be seamlessly detected, tracked and augmented as they travel through space. Finally, we have achieved a realisation of the Cooperative Augmentation concept that is flexible and efficient enough that user studies to evaluate the user experience become feasible. We believe the benefit of our approach can be seen best when directly comparing traditional AR displays (e.g. traditional monitors, HMD, handheld or mobile computers) against projected displays on the surfaces of objects. 161 BIBLIOGRAPHY Bibliography Aixut, T., Y. L. d. Meneses, et al. (2003). Constraining deformable templates for shape recognition. 6th Conference on Quality Control by Artificial Vision (QCAV). Antifakos, S., F. Michahelles, et al. (2002). Proactive Instructions for Furniture Assembly. Ubiquitous Computing (Ubicomp 2002). Göteborg, Sweden, Springer Verlag. Antifakos, S., F. Michahelles, et al. (2004). Towards Situation-Aware Affordances: An Experimental Study. Pervasive Computing, Vienna, Austria, LNCS. Armstrong, M. and A. Zisserman (1995). Robust object tracking. Asian Conference on Computer Vision. Aron, M., G. Simon, et al. (2004). Handling Uncertain Sensor Data in Vision-Based Camera Tracking. Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society. Ashdown, M., M. Flagg, et al. (2004). A Flexible Projector-Camera System for MultiPlanar Displays. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2004, Washington D. C., IEEE. Ashdown, M. and Y. Sato (2005). Steerable Projector Calibration. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops - Volume 03, IEEE Computer Society. Azuma, R., Y. Baillot, et al. (2001). "Recent Advances in Augmented Reality." IEEE Computer Graphics and Applications 21(6): 34-47. Azuma, R. T. (1997). "A Survey of Augmented Reality." Presence: Teleoperators and Virtual Environments 6(4): 355-385. Bandyopadhyay, D., R. Raskar, et al. (2001). Dynamic Shader Lamps: Painting on Movable Objects. Proceedings of the IEEE and ACM International Symposium on Augmented Reality (ISAR'01), IEEE Computer Society. Bartelmus, C. (2007). "Linux Infrared Remote Control Webpage." Retrieved October, 2005, from http://lirc.org/. Beigl, M. and H. Gellersen (2003). Smart-Its: An Embedded Platform for Smart Objects. Smart Objects Conference (sOc). Grenoble, France Belongie, S., J. Malik, et al. (2001). Matching Shapes. International Conference on Computer Vision (ICCV). Belongie, S., J. Malik, et al. (2002). "Shape Matching and Object Recognition Using Shape Contexts." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4): 509-522. 162 BIBLIOGRAPHY Bhasker, E. S., P. Sinha, et al. (2006). "Asynchronous Distributed Calibration for Scalable and Reconfigurable Multi-Projector Displays." IEEE Transactions on Visualization and Computer Graphics 12(5): 1101-1108. Billinghurst, M., H. Kato, et al. (2001). "The MagicBook, a transitional AR interface." Computers & Graphics 25: 745-753. Bimber, O., F. Coriand, et al. (2005). Superimposing pictorial artwork with projected imagery. ACM SIGGRAPH 2005 Courses. Los Angeles, California, ACM. Bimber, O. and A. Emmerling (2006). "Multi-Focal Projection: A Multi-Projector Technique for Increasing Focal Depth." IEEE Transactions on Visualization and Computer Graphics (TVCG). Bimber, O., A. Emmerling, et al. (2005). Embedded entertainment with smart projectors. IEEE Computer. 38: 56-63. Bimber, O. and R. Raskar (2005). Spatial Augmented Reality: Merging Real and Virtual Worlds, A. K. Peters, Ltd. Bischoff, U. and G. Kortuem (2006). RuleCaster: A Macroprogramming System for Sensor Networks. Proceedings OOPSLA Workshop on Building Software for Sensor Networks, Portland, Oregon, USA. Block, F., A. Schmidt, et al. (2004). Towards a Playful User Interface for Home Entertainment Systems. EUSAI 2004, Springer LNCS. Bonanni, L., C. H. Lee, et al. (2005). Counter Intelligence: Augmented Reality Kitchen. Extended Abstracts of Computer Human Interaction (CHI) 2005, Portland, OR. Borkowski, S. (2006). Steerable Interfaces for Interactive Environments. laboratoire GRAVIR – IMAG. Grenoble, INRIA Rhone-Alpes. PhD. Borkowski, S., J. L. Crowley, et al. (2006). User-Centric Design of a Vision System for Interactive Applications. Proceedings of the Fourth IEEE International Conference on Computer Vision Systems, IEEE Computer Society. Borkowski, S., J. Letessier, et al. (2004). Spatial Control of Interactive Surfaces in an Augmented Environment. 9th IFIP Working Conference on Engineering for Human-Computer Interaction (EHCI'04). Hamburg, Germany, Springer Berlin / Heidelberg. Volume 3425/2005: 228-244. Borkowski, S., O. Riff, et al. (2003). Projecting rectified images in an augmented environment. ProCams Workshop. International Conference on Computer Vision, ICCV 2003 Nice, France, IEEE Computer Society Press. Bourgeois, S., H. Martinsson, et al. (2005). A Practical Guide to Marker Based and Hybrid Visual Registration for AR Industrial Applications Lecture Notes in Computer Science 3691, Computer Analysis of Images and Patterns, Springer Berlin / Heidelberg. 3691/2005: 669-676. Bradski, G. R. (1998). "Computer Vision Face Tracking For Use in a Perceptual User Interface." Intel Technology Journal Q2. Brasnett, P., L. Mihaylova, et al. (2005). Particle Filtering with Multiple Cues for Object Tracking in Video Sequences. SPIE's 17th Annual Symp. on Electronic Imaging, Science and Technology, San Jose California, USA. Brooks, F. P. (1999). What's Real About Virtual Reality? IEEE Computer Graphics and Applications IEEE Computer Society Press. 19: 16-27. Brown, M. and D. G. Lowe (2002). Invariant features from interest point groups. British Machine Vision Conference (BMVC 2002). Cardiff, Wales. Brown, M., A. Majumder, et al. (2005). "Camera-Based Calibration Techniques for Seamless Multiprojector Displays." IEEE Transactions on Visualization and Computer Graphics 11(2): 193-206. 163 BIBLIOGRAPHY Butz, A., M. Groß, et al. (2004). TUISTER: a tangible UI for hierarchical structures. Proceedings of the 9th international conference on Intelligent user interfaces. Funchal, Madeira, Portugal, ACM. Butz, A. and A. Krüger (2003). A Generalized Peephole Metaphor for Augmented Reality and Instrumented Environments. The International Workshop on Software Technology for Augmented Reality Systems (STARS), Tokyo, Japan. Butz, A., M. Schneider, et al. (2004). SearchLight - A Lightweight Search Function for Pervasive Environments. Pervasive2004. Vienna, Austria, Springer LNCS. Buxton, W. (1990). A Three-State Model of Graphical Input. Human-Computer Interaction - INTERACT '90. D. Diaper and e. al. Amsterdam, Elsevier Science Publishers B.V. (North-Holland): 449-456. Cabido, R., A. S. Montemayor, et al. (2005). Hardware-Accelerated Template Matching. Pattern Recognition and Image Analysis, Springer LNCS. 3522/2005: 691-698. Canny, J. (1986). "A Computational Approach To Edge Detection." IEEE Transactions on Pattern Analysis and Machine Intelligence(8): 679-714. Cao, X. and R. Balakrishnan (2006). Interacting with dynamically defined information spaces using a handheld projector and a pen. Proceedings of the 19th annual ACM symposium on User interface software and technology. Montreux, Switzerland, ACM. Cao, X., C. Forlines, et al. (2007). Multi-user interaction using handheld projectors. Proceedings of the 20th annual ACM symposium on User interface software and technology. Newport, Rhode Island, USA, ACM. Chandraker, M. K., C. Stock, et al. (2003). Real-Time Camera Pose in a Room. Computer Vision Systems: 98-110. Chekhlov, D., A. Gee, et al. (2007). Ninja on a Plane: Automatic Discovery of Physical Planes for Augmented Reality Using Visual SLAM. International Symposium on Mixed and Augmented Reality (ISMAR07). Nara, Japan. Chen, H., R. Sukthankar, et al. (2002). Scalable alignment of large-format multiprojector displays using camera homography trees. Proceedings of the conference on Visualization '02. Boston, Massachusetts, IEEE Computer Society. Cheng, K. and K. Pulo (2003). Direct Interaction with Large-Scale Display Systems using Infrared Laser Tracking Devices. Conferences in Research and Practice in Information Technology Series (CRPIT2003). Comaniciu, D. and P. Meer (2002). "Mean Shift: A Robust Approach Toward Feature Space Analysis." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5): 603-619. Dao, V. N., K. Hosoi, et al. (2007). A semi-automatic realtime calibration technique for a handheld projector. Proceedings of the 2007 ACM symposium on Virtual reality software and technology. Newport Beach, California, ACM. David, P. and D. DeMenthon (2005). Object Recognition in High Clutter Images Using Line Features. Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2, IEEE Computer Society. David, P., D. DeMenthon, et al. (2002). SoftPOSIT: Simultaneous Pose and Correspondence Determination. Proceedings of the 7th European Conference on Computer Vision-Part III, Springer-Verlag. David, P., D. DeMenthon, et al. (2003). Simultaneous pose and correspondence determination using line features. Computer Vision and Pattern Recognition (CVPR), IEEE. 164 BIBLIOGRAPHY Davison, A. J. (2003). Real-Time Simultaneous Localisation and Mapping with a Single Camera International Conference on Computer Vision (ICCV 2003). Nice, France, IEEE. Davison, A. J. and D. W. Murray (2002). "Real-Time Simultaneous Localisation and Mapping with a Single Camera." IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7): 865-880. Decker, C., M. Beigl, et al. (2004). eSeal - a system for enhanced electronic assertion of authenticity and integrity of sealed items. In Proceedings of the Pervasive Computing, volume 3001 of Lecture Notes in Computer Science (LNCS), Springer Verlag LNCS. Decker, C., A. Krohn, et al. (2005). The Particle Computer System. In the proceedings of the ACM/IEEE Fourth International Conference on Information Processing in Sensor Networks (IPSN05). Los Angeles. DeMenthon, D. and L.S. Davis (1995). "Model-Based Object Pose in 25 Lines of Code." International Journal of Computer Vision 15: 123-141. Deriche, R. and O. Faugeras (1990). "Tracking Line segments." Image and Vision Computing 8(4): 261-270. Dietz, P. and D. Leigh (2001). DiamondTouch: a multi-user touch technology. Proceedings of the 14th annual ACM symposium on User interface software and technology. Orlando, Florida, ACM. DisappearingComputer. (2002). "European IST. The Disappearing Computer Initiative." Retrieved 10th October, 2007, from http://www.disappearing-computer.net/. Drummond, T. and R. Cipolla (2002). "Real-time visual tracking of complex structures." IEEE Transactions on Pattern Analysis and Machine Intelligence 27: 932-946. Ehnes, J. and M. Hirose (2006). Finding the Perfect Projection System – Human Perception of Projection Quality Depending on Distance and Projection Angle. Embedded and Ubiquitous Computing. Ehnes, J., K. Hirota, et al. (2004). Projected Augmentation - Augmented Reality using Rotatable Video Projectors. Proceedings of the Third IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR'04) Volume 00, IEEE Computer Society. Ehnes, J., K. Hirota, et al. (2005). Projected Augmentation II: A Scalable Architecture for Multi Projector Based AR-Systems Based on 'Projected Applications'. Proceedings of the Fourth IEEE and ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society. F.Mokhtarian (1995). "Silhoutte based Isolated Object Recognition through Curvature Scale Space." IEEE Trans. on Pattern Analysis and Machine Intelligence 17(5): 539-544. F.Mokhtarian and S. Abbasi (2005). "Robust automatic selection of optimal views in multi-view free-form object recognition." Pattern Recognition 38(7): 1021-1031. F.Mokhtarian, N. Khalili, et al. (2001). "Multi-scale free-form 3D object recognition using 3D models " Image and Vision Computing 19(5): 271-281. Felzenszwalb, P. F. (2005). "Representation and Detection of Deformable Shapes." IEEE Trans. Pattern Anal. Mach. Intell. 27(2): 208-220. Ferrari, V., T. Tuytelaars, et al. (2001). Makerless augmented reality with a real-time affine region tracker. IEEE and ACM ISAR. Fiala, M. (2005). "ARTag Webpage." Retrieved 03/03/05, from http://www.cv.iit.nrc.ca/research/ar/artag/. 165 BIBLIOGRAPHY Fiala, M. (2005). ARTag, a Fiducial Marker System Using Digital Techniques. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) IEEE Computer Society. Fischler, M. A. and R. C. Bolles (1981). "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography." Commun. ACM 24(6): 381-395. Flagg, M., J. W. Summet, et al. (2005). Improving the Speed of Virtual Rear Projection: A GPU-Centric Architecture. IEEE International Workshop on ProjectorCamera Systems (PROCAMS 2005) held in conjunction with IEEE International Conference on Computer Vision & Pattern Recognition (CVPR 2005). San Diego, California, USA. Foxlin, E. and L. Naimark (2003). VIS-Tracker: A Wearable Vision-Inertial SelfTracker. Proceedings of the IEEE Virtual Reality 2003, IEEE Computer Society. Fujii, K., M. D. Grossberg, et al. (2005). A Projector-Camera System with Real-Time Photometric Adaptation for Dynamic Environments. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01, IEEE Computer Society. Fung, J. and S. Mann (2005). OpenVIDIA: parallel GPU computer vision. Proceedings of the 13th annual ACM international conference on Multimedia. Hilton, Singapore, ACM. Gee, A. P. and W. W. Mayol-Cuevas (2006). Real-Time Model-Based SLAM Using Line Segments. Advances in Visual Computing, Second International Symposium (ISVC 2006), Lake Tahoe, NV, USA, Springer LNCS. Gellersen, H.-W., M. Beigl, et al. (1999). The MediaCup: Awareness Technology Embedded in a Everyday Object. Proceedings of the 1st international symposium on Handheld and Ubiquitous Computing. Karlsruhe, Germany, Springer-Verlag. Genc, Y., S. Riedel, et al. (2002). Markerless Tracking for AR: A Learning-Based Approach. International Symposium in Mixed and Augmented Reality (ISMAR 2002). Geoffrion, J. R. (2005). "Understanding Digital Camera Resolution " Retrieved 10th March, 2005, from http://www.luminous-landscape.com/tutorials/understandingseries/res-demyst.shtml. Giebel, J., D. M. Gavrila, et al. (2004). "A Bayesian Framework for Multi-cue 3D Object Tracking " Lecture Notes in Computer Science Computer Vision ECCV2004 3024/2004: 241-252. Gold, S., A. Rangarajan, et al. (1998). "New Algorithms for 2D and 3D Point Matching: Pose Estimation and Correspondence." Pattern Recognition 31(8): 1019-1031. Gordon, I. and D. G. Lowe (2004). Scene modelling, recognition and tracking with invariant image features. International Symposium on Mixed and Augmented Reality (ISMAR2004), Arlington, VA, USA. Gordon, I. and D. G. Lowe (2006). What and where: 3D object recognition with accurate pose. Toward Category-Level Object Recognition. J. Ponce, M. Hebert, C. Schmid and A. Zisserman, Springer-Verlag: 67-82. Grossberg, M. D., H. Peri, et al. (2004). "Making one object look like another: Controlling appearance using a projector-camera system." IEEE Computer Vision and Pattern Recognition 1: 452-459. Grundhöfer, A. and O. Bimber (2008). "Real-Time Adaptive Radiometric Compensation." IEEE Transactions on Visualization & Computer Graphics (TVCG) 14(1): 97-108. 166 BIBLIOGRAPHY Grundhöfer, A., M. Seeger, et al. (2007). Dynamic Adaptation of Projected Imperceptible Codes. IEEE International Symposium on Mixed and Augmented Reality (ISMAR'07), Nara, Japan. Han, J. Y. (2005). Low-cost multi-touch sensing through frustrated total internal reflection. Proceedings of the 18th annual ACM symposium on User interface software and technology. Seattle, WA, USA, ACM. Harris, C. (1993). Tracking with rigid models. Active vision, MIT Press: 59-73. Harris, C. and M. Stephens (1988). A combined corner and edge detector. Proceedings of the 4th Alvey Vision Conference:. Hartley, R. and A. Zisserman (2003). Multiple View Geometry in Computer Vision, Cambridge University Press. Hirose, R. and H. Saito (2005). A Vision-Based AR Registration Method Utilizing Edges and Vertices of 3D Model. ICAT 2005. Christchurch, New Zealand. Holleis, P. (2005). "EiToolkit." Retrieved 03/01/07, 2007, from https://wiki.medien.ifi.lmu.de/view/HCILab/EiToolkit. Holmquist, L. E., H.-W. Gellersen, et al. (2004). Building Intelligent Environments with Smart-Its. IEEE Computer Graphics & Applications: 56-64. Inc, C. T. (2007). "Crossbow Technology Inc Wireless Sensor Networks." Retrieved 1st April, 2008, from http://www.xbow.com/Home/HomePage.aspx. IntelOpenCV. (2007). "Open Source Computer Vision Library (OpenCV)." Retrieved July, 2004, from http://www.intel.com/technology/computing/opencv/. Intille, S. S., V. Lee, et al. (2003). Ubiquitous Computing in the Living Room: Concept Sketches and an Implementation of a Persistent User Interface. UBICOMP 2003 Video Program, Seattle, Washington, USA, Springer LNCS. Ipiña, D. L. d., P. Mendonça, et al. (2002). "TRIP: a Low-Cost Vision-Based Location System for Ubiquitous Computing,." Personal and Ubiquitous Computing Journal 6(3): 206-219. Isard, M. and A. Blake (1998). "CONDENSATION—Conditional Density Propagation forVisual Tracking." International Journal of Computer Vision (IJCV) 29(1): 528. Isard, M. and A. Blake (1998). A Smoothing Filter for CONDENSATION. Proceedings of the 5th European Conference on Computer Vision-Volume I - Volume I, Springer-Verlag. Ishii, H. and B. Ullmer (1997). Tangible bits: towards seamless interfaces between people, bits and atoms. Proceedings of the SIGCHI conference on Human factors in computing systems. Atlanta, Georgia, United States, ACM. Johnson, T. and H. Fuchs (2007). Real-Time Projector Tracking on Complex Geometry Using Ordinary Imagery. IEEE International Workshop on Projector-Camera Systems (ProCams2007). Minneapolis, MN, USA, IEEE. Jurie, F. and M. Dhome (2002). "Hyperplane Approximation for Template Matching." IEEE Trans. Pattern Anal. Mach. Intell. 24(7): 996-1000. Kale, A., K. Kwan, et al. (2004). Epipolar Constrained User Push Button Selection in Projected Interfaces. 1st CVPR workshop on Real time Vision for Human Computer Interaction. Washington DC, USA. Kalman, R. E. (1960). "A New Approach to Linear Filtering and Prediction Problems." Transactions of the ASME - Journal of Basic Engineering 82: 35-45. Karitsuka, T. and K. Sato (2003). A Wearable Mixed Reality with an On-Board Projector. Proceedings of the The 2nd IEEE and ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society. 167 BIBLIOGRAPHY Kato, H. and M. Billinghurst (1999). Marker Tracking and HMD Calibration for a Video-Based Augmented Reality Conferencing System. Proceedings of the 2nd IEEE and ACM International Workshop on Augmented Reality, IEEE Computer Society. Kirstein, C. and H. Müller (1998). Interaction with a Projection Screen Using a CameraTracked Laser Pointer. The International Conference on Multimedia Modeling (MMM'98), IEEE Computer Society Press. Kjeldsen, R. (2005). Exploiting the Flexibility of Vision-Based User Interactions, IBM Technical Report. Kjeldsen, R., C. Pinhanez, et al. (2002). Interacting with Steerable Projected Displays. Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition, IEEE Computer Society. Klein, G. and T. Drummond (2003). Robust visual tracking for non-instrumental augmented reality. The Second IEEE and ACM International Symposium on Mixed and Augmented Reality. Klein, G. and T. Drummond (2004). Sensor Fusion and Occlusion Refinement for Tablet-Based AR. Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society. Klein, G. and D. Murray (2006). Full-3D Edge Tracking with a Particle Filter. British Machine Vision Conference (BMVC06), Edinburgh, UK. Klein, G. and D. Murray (2007). Parallel Tracking and Mapping for Small AR Workspaces. International Symposium on Mixed and Augmented Reality (ISMAR'07). Nara, Japan. Koike, H., Y. Sato, et al. (2001). "Integrating paper and digital information on EnhancedDesk: a method for realtime finger tracking on an augmented desk system." ACM Trans. Comput.-Hum. Interact. 8(4): 307-322. Koller, D., K. Danilidis, et al. (1993). "Model-based object tracking in monocular image sequences of road traffic scenes." Int. J. Comput. Vision 10(3): 257-281. Kölsch, M. and M. Turk (2002). Keyboards without Keyboards: A Survey of Virtual Keyboards. Sensing and Input for Media-centric Systems (SIMS 02). Kortuem, G., D. Alford, et al. (2007). Sensor Networks or Smart Artifacts? An Exploration of Organizational Issues of An Industrial Health and Safety Monitoring System. The Ninth International Conference on Ubiquitous Computing (Ubicomp 2007). Innsbruck, Austria, Springer. Kortuem, G., C. Kray, et al. (2005). Sensing and visualizing spatial relations of mobile devices. Proceedings of the 18th annual ACM symposium on User interface software and technology. Seattle, WA, USA, ACM. Kotake, D., K. Satoh, et al. (2005). A Hybrid and Linear Registration Method Utilizing Inclination Constraint. Proceedings of the 4th IEEE/ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society. Kotake, D., K. Satoh, et al. (2007). A Fast Initialization Method for Edge-based Registration Using an Inclination Constraint. Proceedings of the 6th IEEE/ACM International Symposium on Mixed and Augmented Reality. Nara, Japan, IEEE Computer Society. Krohn, A., M. Beigl, et al. (2005). Inexpensive and Automatic Calibration for Acceleration Sensors Ubiquitous Computing Systems: 245-258. Laberge, D. and J.-F. Lapointe (2003). An Auto-Calibrated Laser-Pointing Interface for Collaborative Environments. Ninth International Conference on Virtual Systems and Multimedia (VSMM 2003), Montréal, Québec, Canada. 168 BIBLIOGRAPHY Lamdan, Y. and H. J. Wolfson (1998). Geometric Hashing: A General and Efficient Model-Based Recognition Scheme. International Conference on Computer Vision (ICCV). Lamming, M. and D. Bohm (2003). SPECs: Personal Pervasive Systems. Computer. 36: 109-111. Lampe, M. and M. Strassner (2003). The potential of RFID for moveable asset management. Workshop on Ubiquitous Commerce at Ubicomp 2003. Lee, J. C., P. H. Dietz, et al. (2004). Automatic projector calibration with embedded light sensors. Proceedings of the 17th annual ACM symposium on User interface software and technology. Santa Fe, NM, USA, ACM. Lee, J. C., S. E. Hudson, et al. (2005). Moveable interactive projected displays using projector based tracking. Proceedings of the 18th annual ACM symposium on User interface software and technology. Seattle, WA, USA, ACM. Leibe, B. and B. Schiele (2004). Scale Invariant Object Categorization Using a ScaleAdaptive Mean-Shift Search. DAGM'04 Annual Pattern Recognition Symposium, Tuebingen, Germany, Springer LNCS. Leiva, L. A., A. Sanz, et al. (2007). Planar tracking using the GPU for augmented reality and games. ACM SIGGRAPH 2007 posters. San Diego, California, ACM. Lepetit, V. and P. Fua (2005). "Monocular model-based 3D tracking of rigid objects." Found. Trends. Comput. Graph. Vis. 1(1): 1-89. Letessier, J. and F. Bérard (2004). Visual tracking of bare fingers for interactive surfaces. Proceedings of the 17th annual ACM symposium on User interface software and technology. Santa Fe, NM, USA, ACM. Levas, A., C. Pinhanez, et al. (2003). An Architecture and Framework for Steerable Interface Systems. UbiComp 2003: Ubiquitous Computing. Seattle, Washington, USA, Springer LNCS. Levy, B., S. Petitjean, et al. (2002). "Least Squares Conformal Maps for Automatic Texture Atlas Generation." ACM Transactions on Graphics 21(3): 162-170. Lewis, J. and K. Potosnack (1997). Keys and Keyboards. Handbook of HumanComputer Interaction. M. Helander, T. Landauer and P. Prabhu. Amsterdam, North-Holland: 1285-1316. Li, P. and F. Chaumette (2004). Image Cues Fusion for Object Tracking Based on Particle Filter. International Workshop on Articulated Motion and Deformable Objects (AMDO04), Palma de Mallorca, Spain, LNCS Lindeberg, T. (1990). "Scale-Space for Discrete Signals." IEEE Trans. Pattern Anal. Mach. Intell. 12(3): 234-254. Lowe, D. G. (1992). "Robust model-based motion tracking through the integration of search and estimation." Int. J. Comput. Vision 8(2): 113-122. Lowe, D. G. (2004). "Distinctive Image Features from Scale-Invariant Keypoints." International Journal of Computer Vision 60(2): 91-110. Majumder, A. and G. Welch (2001). Computer Graphics Optique: Optical Superposition of Projected Computer Graphics. Eurographics Workshop on Virtual Environment/ Immersive Projection Technology. Malbezin, P., W. Piekarski, et al. (2002). Measuring ARToolkit Accuracy in Long Distance Tracking Experiments. 1st Int'l Augmented Reality Toolkit Workshop, IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2002). Darmstadt, Germany. Matas, J., O. Chum, et al. (2002). Robust wide baseline stereo from maximally stable extremal regions. British Machine Vision Conference, London, UK. 169 BIBLIOGRAPHY Matsumoto, T., D. Horiguchi, et al. (2006). Z-agon: mobile multi-display browser cube. CHI '06 extended abstracts on Human factors in computing systems. Montréal, Québec, Canada, ACM. Mikolajczyk, K. and C. Schmid (2002). An Affine Invariant Interest Point Detector. Proceedings of the 7th European Conference on Computer Vision-Part I, Springer-Verlag. Mikolajczyk, K. and C. Schmid (2004). "Scale & Affine Invariant Interest Point Detectors." Int. J. Comput. Vision 60(1): 63-86. Mikolajczyk, K. and C. Schmid (2005). "A Performance Evaluation of Local Descriptors." IEEE Trans. Pattern Anal. Mach. Intell. 27(10): 1615-1630. Mikolajczyk, K., T. Tuytelaars, et al. (2005). "A Comparison of Affine Region Detectors." Int. J. Comput. Vision 65(1-2): 43-72. Milgram, P. and F. Kishino (1994). "A Taxonomy of Mixed Reality Visual Displays." IEICE Transactions on Information Systems, E77-D(12). MitsubishiElectricCorporation. (2008). "Mitsubishi PK20 Pocket Projector Website." Retrieved 19th March, 2008, from http://global.mitsubishielectric.com/bu/projectors/products/home/pk20.html. Molyneaux, D. and H. Gellersen (2006). Cooperatively Augmenting Smart Objects with Projector-Camera Systems. 3rd IEEE International Workshop on ProjectorCamera systems (ProCams 2006) New York, USA. Molyneaux, D., H. Gellersen, et al. (2007). Cooperative Augmentation of Smart Objects with Projector-Camera Systems. UbiComp 2007: Ubiquitous Computing. Innsbruck, Austria, Springer LNCS. Molyneaux, D., H. Gellersen, et al. (2008). Vision-Based Detection of Mobile Smart Objects. 3rd IEEE European Conference on Smart Sensing and Context (EuroSSC). Zurich, Switzerland, Springer LNCS (to appear). Montemayor, A. S., J. J. Pantrigo, et al. (2004). Particle filter on GPUs for real-time tracking. ACM SIGGRAPH 2004 Posters. Los Angeles, California, ACM. Morishima, S., T. Yotsukura, et al. (2000). HyperMask: Talking Head Projected onto Real Object. International Conference on Multimedia Modeling (MMM 2000), Nagano, Japan. Murase, H. and S. K. Nayar (1995). "Visual Learning and Recognition of 3D Objects from Appearance." International Journal on Computer Vision 14(1): 5-24. Nachman, L., R. Kling, et al. (2005). The Intel Mote platform: a Bluetooth-based sensor network for industrial monitoring. Proceedings of the 4th international symposium on Information processing in sensor networks. Los Angeles, California, IEEE Press. Naimark and Foxlin (2002). Circular Data Matrix Fiducial Systems and Robust Image Processing for a Wearable Vision-Inertial self Tracker. IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2002). Darmstadt, Germany, September-October. Nakamura, N. and R. Hiraike (2002). Active Projector: Image correction for moving image over uneven screens. 15th Annual ACM Symposium on User Interface Software and Technology (UIST2002). Nakazato, Y., M. Kanabara, et al. (2004). Discreet Markers for User Localisation. Eighth International Symposium on Wearable computers (ISWC 2004). Arlington, VA, USA. Nayar, S. K., H. Peri, et al. (2003). A projection system with radiometric compensation for screen imperfections. ICCV Workshop on Projector-Camera Systems (PROCAMS). Nice, France. 170 BIBLIOGRAPHY Newman, J., A. Bornik, et al. (2007). Tracking for distributed mixed reality environments. IEEE VR 2007 Workshop on Trends and Issues in Tracking for Virtual Environments. Charlotte, NC, USA, IEEE Shaker Verlag. Norman, D. A. (1988). The Design of Everyday Things. Boston, Massachusets, MIT Press. Oberkampf, D., D. F. DeMenthon, et al. (1996). "Iterative pose estimation using coplanar feature points." Comput. Vis. Image Underst. 63(3): 495-511. Oh, J. Y. and W. Stuerzlinger (2002). Laser Pointers as Collaborative Pointing Devices. Graphics Interface 2002. Stürzlinger and McCool, AK Peters and CHCCS: 141149. OpenSG. (2007). "OpenSG Website." Retrieved January, 2007, from http://www.opensg.org/. Park, H., M.-H. Lee, et al. (2005). Specularity-Free Projection on Nonplanar Surface. Advances in Mulitmedia Information Processing - PCM 2005, Springer Berlin / Heidelberg. Park, H., M.-H. Lee, et al. (2006). "Surface-independent direct-projected augmented reality." Lecture Notes in Computer Science 3852, Computer Vision: 892-901. Park, H., M.-H. Lee, et al. (2007). Content Adaptive Embedding of Complementary Patterns for Nonintrusive Direct-Projected Augmented Reality. International Conference on Human-Computer Interaction (HCI International'07) Beijing, China. Park, H., M.-H. Lee, et al. (2006). "Undistorted Projection onto Dynamic Surface " Springer LNCS - Advances in Image and Video Technology 4319/2006: 582590. Park, H. and J. Park (2004). Invisible Marker Tracking for AR. Third IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR'04). Arlington, VA, USA. Pingali, G., C. Pinhanez, et al. (2003). Steerable Interfaces for Pervasive Computing Spaces. Proceedings of the First IEEE International Conference on Pervasive Computing and Communications, IEEE Computer Society. Pingali, G., C. Pinhanez, et al. (2002). User-Following Displays. IEEE International Conference on Multimedia and Expo 2002 (ICME'02). Lausanne, Switzerland. Pinhanez, C., R. Kjeldsen, et al. (2001). Transforming Surfaces into Touch-Screens, IBM Research Report Pinhanez, C., R. Kjeldsen, et al. (2002). Ubiquitous Interactive Graphics. IBM Research Report RC22495 (W0205-143). Pinhanez, C., F. Nielsen, et al. (1999). Projecting computer graphics on moving surfaces: a simple calibration and tracking method. ACM SIGGRAPH 99 Conference abstracts and applications. Los Angeles, California, United States, ACM Press. Pinhanez, C. and M. Podlaseck (2005). To Frame or Not to Frame: The Role and Design of Frameless Displays in Ubiquitous Applications. Ubicomp 2005. Tokyo, Japan. Pinhanez, C. S. (2001). The Everywhere Displays Projector: A Device to Create Ubiquitous Graphical Interfaces. Proceedings of the 3rd international conference on Ubiquitous Computing. Atlanta, Georgia, USA, Springer-Verlag. Podlaseck, M., C. Pinhanez, et al. (2003). On interfaces projected onto real-world objects. CHI '03 extended abstracts on Human factors in computing systems. Ft. Lauderdale, Florida, USA, ACM. 171 BIBLIOGRAPHY Pupilli, M. and A. Calway (2006). Real-Time Camera Tracking Using Known 3D Models and a Particle Filter. Proceedings of the 18th International Conference on Pattern Recognition - Volume 01, IEEE Computer Society. Raij, A. and M. Pollefeys (2004). Auto-calibration of multi-projector display walls. International Conference on Pattern Recogniton, Cambridge, UK, IEEE Computer Society. Rapp, S., G. Michelitsch, et al. (2004). Spotlight Navigation: Interaction with a Handheld Projection Device. International Conference on Pervasive Computing 2004 Video Proceedings. Raskar, R., P. Beardsley, et al. (2004). RFIG lamps: interacting with a self-describing world via photosensing wireless tags and projectors. ACM SIGGRAPH 2004. Boston, Massachusetts, ACM. Raskar, R., P. Beardsley, et al. (2006). RFIG lamps: interacting with a self-describing world via photosensing wireless tags and projectors. ACM SIGGRAPH 2006 Courses. Boston, Massachusetts, ACM. Raskar, R., H. Nii, et al. (2007). "Prakash: lighting aware motion capture using photosensing markers and multiplexed illuminators." ACM Transactions on Graphics 26(3): 36. Raskar, R., J. VanBaar, et al. (2005). iLamps: geometrically aware and self-configuring projectors. ACM SIGGRAPH 2005 Courses. Los Angeles, California, ACM. Raskar, R., G. Welch, et al. (1999). Spatially augmented reality. Proceedings of the international workshop on Augmented reality : placing artificial objects in real scenes: placing artificial objects in real scenes. Bellevue, Washington, United States, A. K. Peters, Ltd. Rekimoto, J. (2002). SmartSkin: an infrastructure for freehand manipulation on interactive surfaces. Proceedings of the SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves. Minneapolis, Minnesota, USA, ACM. Rensink, R. A. (2000). "When Good Observers Go Bad: Change Blindness, Inattentional Blindness, and Visual Experience." Psyche 6(09). Robertson, C. and J. Robinson (1999). Live paper: video augmentation to simulate interactive paper. Proceedings of the seventh ACM international conference on Multimedia (Part 2). Orlando, Florida, United States, ACM. Rolland, J. P., Y. Baillot, et al. (2001). A Survey of Tracking Technology for Virtual Environments. Fundamentals of Wearable Computers and Augmented Reality. Mahwah, NJ, USA, Lawrence Erlbaum Assoc. Inc.: 67–112. Rosten, E. and T. Drummond (2005). Fusing points and lines for high performance tracking. IEEE International Conference on Computer Vision, IEEE. Rosten, E. and T. Drummond (2006). Machine learning for high-speed corner detection. European Conference on Computer Vision. Rothganger, F., S. Lazebnik, et al. (2006). "3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints." Int. J. Comput. Vision 66(3): 231-259. Salvi, J., J. Pages, et al. (2004). "Pattern Codification Strategies in Structured Light Systems." Pattern Recognition 37(4): 827-849. Santos, P., A. Stork, et al. (2006). Innovative geometric pose reconstruction for markerbased single camera tracking. Proceedings of the 2006 ACM international conference on Virtual reality continuum and its applications. Hong Kong, China, ACM. 172 BIBLIOGRAPHY Schaffalitzky, F. and A. Zisserman (2001). Viewpoint invariant texture matching and wide baseline stereo. International Conference on Computer Vision (ICCV), Vancouver, BC, Canada, IEEE. Schiele, B. and J. L. Crowley (2000). "Recognition without Correspondence using Multidimensional Receptive Field Histograms." International Journal of Computer Vision (IJCV) 36(1): 31-50. Schmalstieg, D. and D. Wagner (2007). Experiences with Handheld Augmented Reality. The Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2007), Nara, Japan, IEEE and ACM. Schmidt, A., M. Kranz, et al. (2005). Interacting with the ubiquitous computer: towards embedding interaction. Proceedings of the 2005 joint conference on Smart objects and ambient intelligence: innovative context-aware services: usages and technologies. Grenoble, France, ACM. Schmidt, A., M. Strohbach, et al. (2002). Ubiquitous Interaction - Using Surfaces in Everyday Environments as Pointing Devices. 7th ERCIM Workshop "User Interfaces For All" Springer Verlag LNCS. Sheridan, J. G., B. W. Short, et al. (2003). Exploring Cube Affordance: Towards A Classification Of Non-Verbal Dynamics Of Physical Interfaces For Wearable Computing. IEE Eurowearable 2003, IEE Press. Birmingham, UK. Siegemund, F. and C. Flörkemeier (2003). Interaction in Pervasive Computing Settings Using Bluetooth-Enabled Active Tags and Passive RFID Technology Together with Mobile Phones. First IEEE International Conference on Pervasive Computing and Communications (PerCom), IEEE Computer Society. Sinha, S. N., J.-M. Frahm, et al. (2006). GPU-Based Video Feature Tracking and Matching. EDGE 2006, workshop on Edge Computing Using New Commodity Architectures. Chapel Hill. Sinha, S. N. and M. Pollefeys (2006). "Pan-tilt-zoom camera calibration and highresolution mosaic generation." Comput. Vis. Image Underst. 103(3): 170-183. Spassova, L. (2004). Fluid Beam -- Konzeption und Realisierung eines DisplayKontinuums mittels einer steuerbaren Projektor-Kamera-Einheit. Fakultät für Informatik. Saarbrücken, Germany, Universität des Saarlandes. http://w5.cs.unisb.de/publication/file/244/Mira_Spassova_Diplom_Fluid_Beam.pdf. Spengler, M. and B. Schiele (2001). Towards Robust Multi-cue Integration for Visual Tracking. Proceedings of the Second International Workshop on Computer Vision Systems, Springer-Verlag. Stauffer, C. and W. E. L. Grimson (1999). Adaptive background mixture models for real-time tracking. Computer Vision Pattern Recognition (CVPR). Strohbach, M. (2004). The Smart-its Platform for Embedded Context-Aware Systems. First International Workshop on Wearable and Implantable Body Sensor Networks. London. Strohbach, M., H.-W. Gellersen, et al. (2004). Cooperative Artefacts: Assessing Real World Situations with Embedded Technology. Ubicomp2004. Nottingham, England, Springer LNCS. Strzodka, R., I. Ihrke, et al. (2003). A Graphics Hardware Implementation of the Generalized Hough Transform for fast Object Recognition, Scale, and 3D Pose Detection. Proceedings of the 12th International Conference on Image Analysis and Processing, IEEE Computer Society. Sukaviriya, N., M. Podlaseck, et al. (2003). Embedding Interactions in a Retail Store Environment: The Design and Lessons Learned. Ninth IFIP International 173 BIBLIOGRAPHY Conference on Human-Computer Interaction (INTERACT'03). Zurich, Switzerland. Sukthankar, R., R. Stockton, et al. (2001). Smarter Presentations: Exploiting Homography in Camera-Projector Systems. International Conference on Computer Vision. Vancouver, Canada. Summet, J., M. Flagg, et al. (2007). "Shadow Elimination and Blinding Light Suppression for Interactive Projected Displays." IEEE Transactions on Visualization & Computer Graphics (TVCG) 13(3). Summet, J. and R. Sukthankar (2005). Tracking Locations of Moving Hand-held Displays Using Projected Light. Proceedings of Pervasive 2005. Munich, Germany, Springer LNCS. Swain, M. J. and D. H. Ballard (1991). "Color indexing." International Journal of Computer Vision 7(1): 11-32. Terrenghi, L., M. Kranz, et al. (2006). "A cube to learn: a tangible user interface for the design of a learning appliance " Personal and Ubiquitous Computing Journal 10(2-3): 153-158. TexasInstruments. (2005). "Texas Instruments DLP Website." Retrieved 10th March, 2005, from www.dlp.com. Turk, M. and A. Pentland (1991). "Eigenfaces for Recognition." Journal of Cognitive Neuroscience 3(1): 71-86. Underkoffler, J., B. Ullmer, et al. (1999). Emancipated pixels: real-world graphics in the luminous room. Proceedings of the 26th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co. UPnP(TM)Forum, C. M. o. t. (2003). "Welcome to the UPnP(TM) Forum." Retrieved 27/04/05, 2005, from http://www.upnp.org/ Vacchetti, L. and V. Lepetit (2004). "Stable Real-Time 3D Tracking Using Online and Offline Information." IEEE Trans. Pattern Anal. Mach. Intell. 26(10): 13851391. Vacchetti, L., V. Lepetit, et al. (2004). Combining Edge and Texture Information for Real-Time Accurate 3D Camera Tracking. Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality, IEEE Computer Society. Valli, A. (2005). "Notes on natural interaction." Retrieved 17th March, 2005, from www.naturalinteraction.org. VanLaerhoven, K. (2006). "CommonSense Project." Retrieved 01/01/08, from http://eis.comp.lancs.ac.uk/index.php?id=commonsense. VanLaerhoven, K., N. Villar, et al. (2003). Using an autonomous cube for basic navigation and input. Proceedings of the 5th international conference on Multimodal interfaces. Vancouver, British Columbia, Canada, ACM. VanRhijn, A. and J. D. Mulder (2005). An Analysis of Orientation Prediction and Filtering Methods for VR/AR. Proceedings of the 2005 IEEE Conference 2005 on Virtual Reality, IEEE Computer Society. Wagner, D. and D. Schmalstieg (2006). Handheld Augmented Reality Displays. Proceedings of the IEEE conference on Virtual Reality, IEEE Computer Society. Weiser, M. (1991). The computer for the 21st century. Scientific American. 3: 94-104. Weiser, M. (1996). "Ubiquitous Computing." Retrieved 1st April, 2005, from http://www.ubiq.com/hypertext/weiser/UbiHome.html. Weiser, M. and J. S. Brown (1996). "Designing Calm Technology." PowerGrid Journal 1(1). 174 BIBLIOGRAPHY Welch, G. and G. Bishop (1997). SCAAT: incremental tracking with incomplete information. Proceedings of the 24th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co. Wellner, P. (1993). "Interacting with paper on the DigitalDesk." Commun. ACM 36(7): 87-96. Wilson, A. D. (2004). TouchLight: an imaging touch screen and display for gesturebased interaction. Proceedings of the 6th international conference on Multimodal interfaces. State College, PA, USA, ACM. Wilson, A. D. (2005). PlayAnywhere: a compact interactive tabletop projection-vision system. Proceedings of the 18th annual ACM symposium on User interface software and technology. Seattle, WA, USA, ACM. Wolfson, H. J. (1990). Model-based object recognition by geometric hashing. Proceedings of European Conference on Computer Vision (ECCV'90), Springer LNCS. Yang, R., D. Gotz, et al. (2001). PixelFlex: a reconfigurable multi-projector display system. Proceedings of the conference on Visualization '01. San Diego, California, IEEE Computer Society. Yang, R. and G. Welch (2001). Automatic Projector Display Surface Estimation Using Every-Day Imagery. 9th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision. Plzen, Czech Republic. You, S. and U. Neumann (2001). Fusion of Vision and Gyro Tracking for Robust Augmented Reality Registration. IEEE Virtual Reality Conference 2001 (VR 2001). You, S., U. Neumann, et al. (1999). Hybrid Inertial and Vision Tracking for Augmented Reality Registration. Proceedings of the IEEE Virtual Reality, IEEE Computer Society. Zhang, X., S. Fronz, et al. (2002). Visual Marker Detection and Decoding in AR Systems: A Comparative Study. International Symposium on Mixed and Augmented Reality (ISMAR'02). Darmstadt, Germany. Zhang, Z. (2000). "A Flexible New Technique for Camera Calibration." IEEE Trans. Pattern Anal. Mach. Intell. 22(11): 1330-1334. 175 APPENDICES Appendix A Steerable Projector-Camera System Construction This appendix describes the design and assembly of two steerable projector-camera systems. We provide an overview of design characteristics of typical systems and present recommendations for those wanting to build similar equipment. The hardware design used in this research is inspired by four closely related projects: the Everywhere Display by Pinhanez et al., the FLUIDbeam project by Butz et al. the Projected Augmentation projector by Ehnes et al. and the PRIMA project by Borkowski et al. [Pinhanez 2001; Borkowski, Riff et al. 2003; Butz, Schneider et al. 2004; Ehnes, Hirota et al. 2004]. Consequently, we present an example characterisation of one of the steerable projector systems and compare it to other commercial and research systems. A.1 Steerable Projector-Camera System Design Steerable Projector-Camera systems share three common classes of components from which the system is constructed: 1. Steering Mechanism 2. Video Projector 3. Camera These major characteristics of these components are identified in the sections below to allow objective measurement of the performance of steerable projectors and to allow comparison of strengths and weaknesses in different approaches and implementations. A.2 Steering Mechanism The two most popular methods to create a steerable projector from a fixed projector are to use either a moving mirror in front of the projector lens, or by mounting the projector in a moving head yoke. The steering mechanisms currently used in steerable projectors have their roots in the display lighting industry. The mechanisms are usually contained in moving head lights and moving mirror scanners used for dynamic lighting on stage, in discos, or in retail 176 APPENDICES and entertainment environments. These lights are typically controlled by the unidirectional DMX512/1990 serial protocol (often shortened to DMX). Figure A.1 (left) Moving mirror display light, (right) Moving head display light [SteinigkeShowtechnicGmbH 2005] Both types of steering mechanism provide two degrees of freedom – pan and tilt. This freedom is implemented in both systems as rotation of an object (mirror or projector) around the respective axes. The world axis around which this rotation occurs changes depending on the orientation of device mounting, so when referring to pan and tilt in this document we assume it is relative to the local steering mechanism mounting. Typically the steering mechanism is mounted so pan is movement in the horizontal direction and tilt is movement in the vertical plane. The moving mirror approach was first implemented by IBM for their Everywhere Display prototype in 2000 [Pinhanez 2001] and later by UNC in their PixelFlex reconfigurable multi-projector display [Yang, Gotz et al. 2001]. The moving head approach is used by the majority of steerable projector implementations. First proposed by Nakamura and Hiraike [Nakamura and Hiraike 2002], this steering method is implemented in the FLUIDbeam project by Butz et al. the Projected Augmentation projector by Ehnes et al. and the PRIMA project by Borkowski et al. [Pinhanez 2001; Borkowski, Riff et al. 2003; Butz, Schneider et al. 2004; Ehnes, Hirota et al. 2004]. Purely optical beam steering mechanisms, such as lenses, spatial light modulators or Holographic Optical Elements (HOE) are all possible. However, unlike the mechanical steering methods, such systems are generally not commercially available, hence, will not be discussed further. A.2.1 Steering Mechanism Characteristics We identified 12 important characteristics of steering mechanisms: 1. Mechanism Rotation Speed Faster rotation capability is better, allowing for rapid tracking of objects at close range and fast movement between spatially distant objects. However, high speed is not an absolute requirement, as continuous small movements in different directions once actively tracking an object may never achieve full speed rotation due to the acceleration and deceleration time required. These types of movements would be governed more by the mechanism acceleration and inertia characteristics. We measure both moving mirror and moving head angular rotation speed in degrees per second. 177 APPENDICES 2. Mechanism Acceleration and Inertia System inertia is dependant on the mass that needs to be rotated – in the case of moving head systems a projector is much more massive than a mirror, so will have higher inertia. High inertia systems are more likely to exhibit control problems such as overshoot when attempting to position accurately. Lower inertia systems are more responsive as they allow faster acceleration and more rapid changes in direction. We measure angular acceleration in terms of degrees per second squared, but this is a difficult quantity to practically measure, without accurate position feedback from the steering mechanism or accelerometers mounted on the steering mechanism. 3. Mechanism Field of View From a fixed location, many moving head systems are capable of panning and tilting to cover an area greater than a hemisphere. However, moving mirror systems have a more limited field of view due to optical arrangement of the mirror and projector, as shown in Figure A.2 (left). Here, the rotation in pan is relatively unconstrained except for physical obstruction of the mechanism mounting bracket. Rotation in tilt is much more limited - both by the requirement to keep the reflective mirror surface towards the projection beam when tilting up, and by the projector body itself occluding the light when tilted down. Field Of View (FOV) is measured in degrees of coverage for each axis separately (pan and tilt). A larger field of view is preferable as it allows the flexibility for a single steerable projector system to create displays in a larger spatial area. Figure A.2 (left) The IBM Everywhere Display Steerable Projector Design, (right) Everywhere Display Projection Cone [Pinhanez 2001] 4. Mechanism Positioning Accuracy The Steering mechanism will have a mechanical accuracy determined by the mechanism FOV angles in pan and tilt and the control interface resolution. On display lighting the DMX control interface is either 8-bit or 16-bit, allowing either 256 or 65536 discrete positions to be resolved respectively. These discrete positions are the number of possible angular positions the pan and tilt unit can be commanded to occupy in each axis. For systems with a high mechanical FOV and low resolution control interface the angle between discrete positions will be large, and small movements between the commanded positions will appear jerky. Conversely with a low mechanical FOV and high control resolution, the angular resolution is much higher; hence changes in display position will look smooth. We measure the angular positioning resolution in degrees by dividing the mechanical FOV by the number of addressable control positions (e.g. 360º FOV / 256 steps = 1.4º 178 APPENDICES per step). High accuracy positioning (a lower numerical figure) is recommended, as it allows displays to be located more accurately and less jerky display movement when projecting at distance. 5. Mechanism Positioning Repeatability While the positioning accuracy measures how accurate a position is, the repeatability measures how close the mechanism returns to the same commanded position following a movement. Over time mechanical systems exhibit wear, hence repeatability is determined both by the mechanical properties of the steering mechanism and the ability of the built-in position feedback sensing system to correctly measure its rotation. The control interface uses a closed-loop control method, so will command the mechanism to move to a specific position and monitor the movement until it reaches the correct position as determined by the position feedback sensors. These sensors are typically optical break-beam sensors for moving head and resistive sensors for servos in moving mirror systems. The optical sensors use a circular disk with cut holes to create pulses as they pass through the break-beam sensor. By dividing the number of detected holes by the known angular range (or required numerical control range) between the end stops, the yoke system can move to an absolute position by simply moving until it reaches the correct number of holes from the end stop. The resistive sensors are typically potentiometers attached to the rotating spindle which encode position by change in resistance. This is converted to a change in voltage using a voltage divider circuit and compared with the input voltage used to control the servos. For a repeatable position the mechanism rotation motors and position feedback sensors need to be accurate enough to resolve to the same accuracy as the control system. The control system must also be able to dynamically manage under and overshoot as rotation stops; otherwise it exhibits oscillation around the required position (hunting). A steering mechanism with high repeatability is preferable, as it allows displays to return to the same spatial locations consistently. 6. Position Stability Positional stability measures how well a system can hold an exact position without un-commanded movement occurring. This characteristic is determined partly by the torque characteristics of the positioning motors, partly by the mass of the load the motors must hold (projector or mirror) and whether any other external forces are applied. For stability, any jitter in the control and position feedback systems must be eliminated. However, using this definition, stability is difficult to measure quantitatively – a steering mechanism is generally either stable or not (for example, if a projector is too heavy for a moving head steering mechanism). Any useful system is required to have positional stability in all possible orientations. Positional stability could also be defined as the steering mechanism and mounting method’s response to rotational movement – for example, whether a panning projector using the moving head method causes unwanted structural response when it starts and stops due to its inertia. Again this is difficult to quantify except as the magnitude of structural response (which would require mounting accelerometers), or possibly as the time taken for the response to settle. 7. Size 179 APPENDICES The size of a steerable projector is determined mainly by the size of a projector for a moving mirror mechanism. Smaller projectors have smaller lenses, allowing correspondingly smaller mirrors and mechanisms to steer the beam. Moving head systems are necessarily larger than the projector themselves due to their design, which encapsulates the projector at the centre of rotation. Smaller is generally better, as it allows easier mounting and better portability. Size is best measured as the volume occupied by the system in metres cubed. 8. Weight The weight of a steering mechanism, measured in kilograms (kg), will determine how it can be mounted, and how portable it will be. Due to their small size, moving mirror systems can be very light (hundreds of grams), whereas moving head systems are generally in the tens of kilograms range. Lighter is better, as it increases the portability and mounting options. 9. Mounting Options For fixed steerable projectors, the type of mounting required will depend heavily on the size, weight and portability of the steering mechanism and projector. The environment will also play a large role, as there is a big structural difference between mounting direct to a concrete ceiling, and mounting to a false ceiling. Generally, a lighter weight system will be more flexible in mounting location than a heavier one. 10. Portability Although current steerable projectors generally have fixed mountings, portable steerable projector systems would share characteristics with handheld projectors and wearable projectors described in related work section 2.3.2. 11. Image Distortion Although not easily quantifiable, different beam steering methods cause different amounts and types of distortion in a projected image. For example, moving mirror systems will exhibit more rotation distortion of the uncorrected image than moving head systems, due to optical arrangement caused by the fixed projector and moving mirror, as seen in Figure A.2 (right). Less distortion is always preferred, which is accomplished by projecting as orthogonal to the display surface as possible. 12. Cost There is a price-performance trade-off as more expensive systems generally have better positioning accuracy, stability, repeatability and build quality. An acceptable balance must always be found between these specifications and price. A.2.2 Moving Mirror and Moving Head comparison Moving Mirror Benefits • Very fast rotation speed possible with fast acceleration due to low inertia • Small size and volume 180 APPENDICES • • • • • • • • Lightweight unit, easily mounted - possible to “bolt-on” to the projector Can be used with a portable stand or with new generation of small portable projectors as mirror size scales with projector lens size. Low cost Camera has separate pan and tilt unit, allowing multitasking (e.g. object searching) while not required for interaction or calibration Problems Restricted field of view (Typically around 230° pan, 50° tilt [Pinhanez 2001]), requiring location high in a corner or high along the centre of a wall for largest coverage. Only limited 8-bit accuracy pan/tilt positioning units available Increased image distortion due to angled mirror surface in light path (primarily causing image rotation distortion) Camera requires separate pan and tilt unit, leading to calibration, registration and coordination problems with projector Moving Head Benefits • Fast Rotation speed is possible • Good for central location in large environments, as many pan and tilt units coverage greater than a hemisphere (e.g. 360° coverage in pan and 270° in tilt) • Less distortion than moving mirror due to direct projection on surfaces. • Camera can be directly attached to the projector, so projector-camera calibration can be performed once, with no further registration and coordination problems • • • • • • Problems High inertia can cause poor acceleration, precluding very fast object tracking Large Steering mechanism volume Heavy unit, requiring professional installation on a wall or ceiling No portability High Cost Camera is attached to projector, so cannot be used for anything else (e.g. searching) while projector static A.3 Video Projector The video projector is required to display information on the surfaces of objects as visibly as possible. We identified 10 projector characteristics to take into account: 1. Brightness The brightness of a projector is typically measured in American National Standards Institute (ANSI) Lumens – which is a measurement of the total amount of light a 181 APPENDICES projector can produce when projecting a white image. To be clearly visible in ambient light the projector must output a greater amount of light to a given surface area than the ambient lighting. The perceived brightness of a projected image will vary depending on the size of the projection and the reflective properties of the projection surface. The larger the projected image, the darker it will appear, as the fixed lamp light output is spread over the increasingly larger area of projection surface. The reflectivity of the projection surface will also have an impact, as projection on low reflectivity surfaces (dark, diffusing, matt surfaces) will appear substantially darker than surfaces with higher reflectivity (bright, reflecting, glossy surfaces). However, there is a trade-off, as very reflective surfaces will also increase the probability of “hotspots”, where the projector and other light sources are clearly visible as bright reflection spots on the surface. This reflectivity measurement is expressed as “gain” for commercial projection screens; however, as steerable projectors just use everyday surfaces in the environment, no surface gain measurements are typically available. 2. Contrast Contrast (also known as dynamic range) is the apparent difference between projected white light and the absence of projected light (black). In the presence of normal office or home illumination an absolute black (i.e. total darkness) cannot be attained by front projection, as there is always ambient light reflecting from the display surface. Ambient light reduces the difference between the projector’s maximum white and black (i.e. no projection), reducing its contrast and making the image look “washed out”. Higher contrast ratios are preferable for better the readability of the display. One method of increasing both the brightness and contrast of projected displays was demonstrated by Majumder and Welch [Majumder and Welch 2001], who superimposed two identical projector displays to double the maximum brightness. By changing the display addressing system this also allowed high dynamic range colour images to be displayed. 3. Resolution As steerable projectors are frequently projecting off-axis or on rotated objects, there can be significant amounts of geometric distortion in the projected image. The correction methods discussed in related work section 2.3.6 distort the projected image, which reduces the effective resolution we are projecting. Pinhanez et al. claimed their effective resolution was typically reduced from 1024x768 pixels to 640x480 pixels following geometric correction [Pinhanez, Kjeldsen et al. 2002] in their moving mirror Everywhere Display system. Consequently, the greater the resolution of the projector we start with, the better the final image will look. 4. Projector Technology – LCD versus DLP Liquid Crystal Displays (LCD) are the traditional technology used in projectors – they are relatively cheap, widely available and create a stable image. Projectors typically contain three LCD panels - one for each colour (Red, Green, Blue), each around 1-2 inches in size. Texas Instruments developed a single chip Digital Light Processor (DLP) engine for projectors using a chip less than an inch square with the surface covered in millions of 182 APPENDICES miniature reflecting mirrors, allowing projector size to be reduced substantially. Without the need for three separate colour panels and integrating optics, smaller and more lightweight projectors are possible with the equivalent brightness as an LCD projector. However, there is one major drawback with consumer DLP projectors – they possess only a single DLP chip, so a rotating colour wheel must be used to time sequentially display frames of red, green and blue colour, as shown in Figure A.3 (left). In many models a fourth transparent area is also used on the colour wheel to boost light output at the cost of colour resolution, as humans are more sensitive to changes in luminance than colour. These frames are integrated and seen as a full colour image by a human eye. Figure A.3 (left) Single chip DLP projector optics [TexasInstruments 2005], (right) Mitsubishi Pocket Projector [MitsubishiElectricCorporation 2008] However, cameras can have an exposure shorter than the integration period, so may see only a single colour frame or partial frame. To see the whole frame a camera must either be explicitly synchronised with the projector, or the exposure increased to integrate all the colour components (16.7ms for a 60Hz projector refresh). This problem mainly exists when the camera auto exposes to the projector illumination – typically in darker environments where the projector illumination is brighter than ambient illumination, or where the projected image fills the majority of the camera frame. Hence, the problem can be reduced by manually setting the camera exposure rather than relying on auto-exposure. 5. Lamp Life The use of a traditional video projector in steerable projectors raises issues with the limited lamp life (generally in the region of a few thousand hours), the high replacement cost and large size. Companies are now producing Light Emitting Diode (LED) projectors and LASER based projectors small enough to fit in the palm of a hand, with light sources that last over 20,000 hours (i.e. the design lifetime of the projector). Although the current projectors are only low brightness and resolution, this technology should be expected to replace metal halide lamps in video projectors. An example LED projector can be seen in Figure A.3 (right). 6. Lens Location Lens location is another consideration for steerable projector systems. If using a moving head system then the optical Centre of Projection (COP) should ideally be at the Centre Of Rotation (COR) of the pan and tilt unit to make calibration easier. In this spatial arrangement only the rotation component of the projector to world coordinate system transformation changes with the rotation of the projector, rather than a combined 183 APPENDICES rotation and translation for offset lenses. Consequently, a centre lens projector design should use a dual fork yoke, and an offset lens projector should ideally use a single arm yoke so the COP is always at the COR. 7. Powered Focus Projectors generally exhibit a good depth of field, with Pinhanez finding that surfaces up to 30° inclination relative to the projector axis generally remaining in focus [Pinhanez, Kjeldsen et al. 2002]. However, computer controlled powered focus allows projection at dynamically varying depths, which is useful when tracking mobile objects. When available, LASER based projectors will not require any focus control. Their coherent and collimated light output remains in focus at all depths, allowing projected displays completely in focus on objects at extremely oblique angles and simultaneous projection on foreground and background objects with large differences in distance. 8. Powered Zoom Computer controlled powered zoom allows different sized diaplays to be projected at different distances. Projector zoom is measured as a ratio between the largest and smallest image produced with the zoom at wide and telephoto respectively. For example, 1:1.3 indicates the image can be resized by 30%. Higher zoom ratios enable projection of smaller images at longer distances (giving higher resolution) or larger images at close distances (but with lower resolution). It is useful to know what display size and resolution can be expected from a projector at each distance, as this will determine what size objects can be projected on. The actual achieved display resolution will vary with the projector FOV, projector zoom, distance to the display surface, its size and orientation (due to geometric correction), but can be calculated with known object geometry, pose and projector intrinsic parameters. 9. Weight The weight of a projector is generally proportional to its brightness and cost. Projectors over 3500 Lumens typically cost significantly more and weight more, as they are designed for large displays and may incorporate two lamps. The projector weight will partly determine which steering method is used - heavier projectors are difficult to use with moving head steering mechanism, as they have a high mass and hence a high inertia. Conversely, high weight is not a problem for moving mirror systems, as the projector itself is statically mounted and only the mirror moves. 10. Cost The cost of consumer projectors is generally proportional to their resolution and brightness. All other things being equal, a lower cost projector is preferable. A.4 Camera We identified 7 relevant characteristics of typical cameras: 1. Camera Type and Mounting Steerable projector-camera systems use the camera for display calibration and to detect user interaction, hence the steerable projector camera requires a view of the 184 APPENDICES display surface. An identical view can be provided by making the projector and camera co-axial using a partially reflective optical beam-splitter, as discussed by Fuji et al. [Fujii, Grossberg et al. 2005] and seen in Figure A.4 (left). However, use of a beamsplitter reduces the light output from the projector and the co-axial optics prevent use of the camera with structured light geometry calibration methods, which require an offset camera to recover the surface geometry, as discussed in related work section 2.3.6 and section 6.4. Instead, moving head steering systems can have a camera mounted directly to the projector, close to the projector lens. This location gives an on-axis view similar to that from a co-axial camera, but avoids the requirement for additional optical elements. The lens offset also typically provides a large enough baseline to be used in structured light calibration. As it is difficult to mount a camera to a moving mirror without adversely affecting the steering performance, in this case a separate pan and tilt camera must be used. This has the additional benefit of allowing the camera to be used separately from the projector, for example, by scanning an environment for objects or for activity detection. However, this approach does introduce registration and coordination problems between the projector display and camera view. 2. Camera Resolution The spatial resolution of a camera is dependent upon the resolving power of the lens, the Field Of View (FOV) of the lens and the pixel resolution of the image sensor. Traditionally, the measure of resolution is described by the cameras resolving power, expressed in line pairs per millimetre (lp/mm). This is evaluated as the ability to distinguish distinct line patterns from standard test charts, such as that shown in Figure A.4 (centre). Figure A.4 (left) A Co-axial projector-camera system [Fujii, Grossberg et al. 2005] (centre) Camera Resolution Test Chart [Geoffrion 2005], (right) Bayer Pattern Colour Filter Array [Geoffrion 2005] A higher resolution camera (i.e. higher quality lens, smaller lens FOV, or higher pixel resolution sensor) will allow detection of objects at greater distances and a generally “clearer” picture of the environment. However, large objects may be too big to fit in the frame when close to the camera with a smaller lens FOV, and increasing resolution with a higher pixel resolution sensor incurs an increased processing cost. For steerable projectors a pixel resolution of 640x480 pixels or higher is recommended, while the lens FOV required is dependant on the environment dimensions and detection scenario. 3. Video Frame Rates Typical video frame rates are either 25fps for PAL video in Europe or 30fps for NTSC video in the USA and Canada. The use of cameras capable of video frame rates is important in dynamic environments, as it allows applications such as object tracking 185 APPENDICES and interaction detection based on movement. However, the actual frame rate achieved is both a function of the exposure length (long exposures for darker environments cause low frame-rates) and data transfer time to the computer. Similarly, while it is possible to purchase very high resolution machine vision cameras, due to bandwidth restrictions most are either only capable of very low frame-rates at full resolution or very low resolution at video frame rates. For example, interfaces such as USB 2.0 (480MB/s) and IEEE-1394 Firewire 400MB/s can only capture a resolution of 1280x1024 at a maximum of 27fps. Commercial machine vision cameras present another solution to increase frame rates through binning pixels or region of interest limiting. Binning achieves a higher frame rate by allowing a sensor to capture information at full resolution then typically halving the resolution transferred to the computer by spatially averaging 4 pixels into one. Despite the averaging, this approach still gives a higher quality result than using a lower resolution sensor camera. In contrast, region of interest limiting allows a small rectangular sub-area of a CMOS camera sensor to be chosen for transfer to the computer, again limiting the final image resolution, but increasing the frame rate. The chosen area can also be dynamically shifted around on the sensor, allowing the potential for very high speed object tracking using small detection windows within the camera’s field of view (e.g. 100fps for a 100x100 window). However, both these solutions typically require use of a camera manufacturer’s proprietary software API. 4. Colour or Grayscale Photodiodes in camera sensors are not sensitive to colour, only luminance. To enable colour sensitivity either one image sensor per colour is used or for single sensor cameras a colour filter array (CFA) is placed over the pixel array of the sensor. The most common CFA uses the Bayer pattern, which uses twice as many green filters as red or blue (as the human eye is more sensitive to green light), as shown in Figure A.4 (right). When processing the raw information provided by each photodiode, the camera analyzes each pixel colour value in a neighbourhood, spatially interpolates to create an approximation of the original full-colour image. However, due to this interpolation colour errors can be introduced and the effective image resolution reduced. The use of a colour sensor is recommended though, as it allows object detection and tracking based on colour (which is a very useful cue, as shown in Chapter 5). 5. Dynamic Range Photodiodes are only sensitive to a certain range of luminance, called the dynamic range (or contrast range). Anything outside this range is clipped to either pure black or pure white. Hence, a camera sensor capable of resolving a higher dynamic range will capture more information in an image. One factor affecting the dynamic range is the size of the camera sensor, as larger photodiodes can integrate light over a greater area and hence, capture greater differences in the total amount of light reaching the sensor. Consequently, cameras with small ¼ inch square sensors (such as cheap web cameras) will not exhibit as high a dynamic range as digital SLR cameras with one inch square sensors. Some commercial CMOS machine vision cameras also allow programmable sensitivity, where the camera has a non-linear response to light. Here a camera can be set to have a linear response in darker areas (or even a boosted response to increase shadow brightness) and a reduced response in highlight areas, allowing detail to be captured in a much larger dynamic range. 186 APPENDICES 6. Optical Distortion All optical imaging systems introduce distortion into a captured image. The most common distortions in lenses are barrel distortion or pincushion distortion (where straight lines appear curved) and chromatic aberration (where light of different colours converges at different spatial locations). Vignetting is possible in wider angle lenses, causing a noticeable darkening of the image towards its corners. Similarly, in colour digital imaging systems there is also the possibility a purple fringing of objects against high contrast backgrounds which is a combination of chromatic aberration and blooming on the camera sensor where the photodiodes are overloaded. Many of these distortions can be modelled mathematically and corrections applied to the image, as discussed in related work section 2.6.4. Although higher quality optics cost more, they generally exhibit less distortion, hence require less calibration, modelling and correction. 7. Lens Field of View and Zoom The Field Of View (FOV) of a camera lens will partially determine its spatial resolution, as described above. A larger FOV from wide-angle lenses allows a system to detect and track objects over a greater area, but at the cost of a lower spatial resolution. Many cameras have fixed focal length lenses, requiring the lens to be physically replaced to change the FOV to fit the requirement of the environment – for example, using a wide-angle in small environments or telephoto lens when the steerable projector is located far from the objects of interest. One solution is to use a computer controlled zoom lens to extend its useful range. Zoom is beneficial as it allows a single lens to be used for both wide field of view, low spatial resolution applications (tracking of objects close to the camera), and small field of view, high spatial resolution applications (tracking objects at great distances). However, although flexible, typical powered zoom lenses are much larger and heavier than their fixed-focal length counterparts. A.5 Commercial Steerable Projectors Although there is a range of thousands of moving display lights, there are currently only four commercial steerable projector systems known to be in production. These systems are all designed to be used as display lights; hence the majority do not include a camera with the system. Table A.1 Four commercial steerable projectors Publitec Beammover £9,500 for a 4100 Lumen steerable projector. (Based on the Sanyo XP-46 projector.) Publitec are based in Germany. http://www.beamover.com/ 187 APPENDICES High-end Digital Light Range DL.1 to DL.3 £15,000 for a 4100 Lumen steerable projector. (Based on the Sanyo XP-46 projector) DL.2 and DL.3 projectors include an attached camera (768x494 pixel resolution with powered focus and zoom) High-end are based in the USA. http://www.highend.com/products/digital_lighting/ High-end “Orbital One” $14,000 Steerable Mirror Head A “bolt-on” moving mirror attachment for projectors. http://www.highend.com/products/digital_lighting/orbitalhead.asp Active Vision System AV-4 Based on a 3000 Lumen projector. Active Vision are based in Japan. http://www.activevision.jp/english/index.html A.6 Construction of Steerable Projector Systems An explicit list of goals and constraints for steerable projector systems was developed, based on the dimensions of the lab environment described in section 6.3 and the important characteristics identified in A2-A4: • • • • • • • • Fixed mounting location with field of view at least equivalent to a hemisphere around the steerable projector, able to project an interface on any surface specified in the area’s volume. Fast positioning system, able to move a projected interface 180° in both pan and tilt axes in 1 second, or less. Stable positioning system Angular positioning accuracy of 0.2° or higher. Accurately and repeatably position projected interfaces to within 1cm of specified location in the area, at a range of 2.8m (the average distance from the centre of the area to the walls). Able to project a focussed image to all locations in the area, visible under normal room illumination levels. Able to capture colour images of the projected image at a minimum of 640x480 pixel resolution and video frame rates. Cost under £10,000. These goals are a combination of hardware and software characteristics, which when combined were hoped would create a viable steerable projector system. 188 APPENDICES A.7 Steerable Projector Hardware To achieve the goal, steerable projector components meeting the criteria above were specified and purchased. As part of the specification process, we used the characteristics identified in A2-A4: A.7.1 Projector Selection We required the projectors to be visible with typical office fluorescent lighting. A average brightness around 3000 Lumens was found to work well in everyday room environments by related work [Butz, Schneider et al. 2004; Ehnes, Hirota et al. 2004], however, due to the ambient light levels that would be encountered, only projectors with a contrast of 500:1 or greater were short-listed. XGA projectors (1024x768 pixels) were selected as they are the most common resolution for business graphics and available at 3000 Lumens. Higher resolution projectors are much more expensive, and lower end models generally exhibit less than 2000 Lumens brightness. Computer controlled focus and zoom capability was specified as a requirement due to the limited depth distance in which a projector image appears in focus. This allows projection on varying sizes of objects and at different distances. The lens location was a major consideration, as it impacts the selection of the type of pan and tilt yoke for the moving head steerable projector system. There was only one projector with a centre lens and the required brightness and resolution specifications the Sanyo PLC-XP45 projector. Whereas the choice of projectors with the lens offset was much greater, hence these were short-listed. For moving head steerable projectors the projector weight is a major concern, as pan and tilt platforms have a limited maximum mass they can rotate. Current LCD projector optical designs trade-off between brightness and weight, however, DLP technology enables a smaller optical path, hence bright projectors in a small, light form factor. The cost of a projector was limited to a maximum of £4,000 to allow for the purchase of other items such as the pan and tilt yoke and cabling. At the time of purchase, two projectors met the specifications defined above. These specifications are summarised in Table A.2. Table A.2 Projector Specifications Projector LCD / DLP Brightness (Lumens) Contrast Resolution Weight (kg) Cost XGA Powered Zoom and Focus Yes NEC MT1065 Casio XJ450 LCD 3200 800:1 5.9 £3,500 DLP 2800 1000:1 XGA Yes 2.7 £3,000 Both projectors were eventually purchased, with the heavier NEC MT1065 used for a moving mirror steerable projector and the Casio XJ-450 used for a moving head design. 189 APPENDICES A.7.2 Steering Mechanism Selection Both single and dual fork yokes (as shown in Figure A.5) have built in stepper motors to pan and tilt the moving head. The yokes are capable of absolute positioning via the DMX control protocol. The centre of rotation with a dual fork yoke is the centre of the yoke arms, whereas for the single arm yoke it is in-line with the central bearing. Borkowski et al. provide construction details for a dual fork pan and tilt mounting with an offset pan rotation location [Borkowski 2006], allowing offset lens projectors to be used. However, this requires workshop facilities and Borkowski admits with hindsight he would recommend purchasing off-the-shelf rather than building himself. In contrast, while the single arm moving head solution allows offset lens projectors to be mounted, the single mounting point places a greater mechanical strain on the arm and tilt stepper motor. Consequently, the projector must be kept as light as possible. Figure A.5 (left) Futurelight PHS150 Single arm Yoke, (right) Compulite “Luna” dual fork Yoke The specifications of the single-arm Futurelight PHS-150 (purchased to be used with the Casio XJ-450 projector) are summarised in Table A.3. Table A.3 Yoke Selection Specifications Yoke Future-light PHS-150 Field of View/ Coverage 540° Pan 265° Tilt Rotation Speed Control Resolution Angular Resolution Size, Weight Cost Not published 16-bit 0.008° Pan 0.004° Tilt 0.051m3 13kg £570 A.7.3 Camera Selection Industrial machine vision cameras offer a good performance and high quality output due to large sensors which gather a lot of light and high quality lenses with low optical distortion. For the moving head steerable projector we selected a PixeLink A742 colour CMOS camera with 1280x1024 pixel resolution, capable of 27 frames per second (fps) at full resolution and up to 104fps with a 640x480 region of interest enabled. The camera uses firewire for data transfer and has a fixed focal-length 12mm C-Mount lens with 40x30° field of view (FOV). For the demonstration applications in this work, the camera is used near its maximum aperture (f1.8) with the focus fixed at 2.5m. This gives an acceptable focus between 1.5m and 5m from the camera, while typically allowing exposure lengths under 40ms with the office lighting. 190 APPENDICES Additionally, a Logitech Sphere pan and tilt web camera was selected for use with a moving mirror system. This has 320x240 pixel resolution with a built-in lens, and is capable of 30fps. The computer controlled mechanical pan and tilt design has a 180° pan Field Of View (FOV) and 60° tilt FOV. No aperture or focus control is available. A.8 Moving Head Steerable Projector Construction Following purchase of the steerable projector hardware components, there were six steps required to create an operational moving-head steerable projector system: 1. Remove disco light head from the pan and tilt yoke 2. Remove the light control circuitry and wiring from the yoke arm and base 3. Re-wire the yoke with power, data and camera cables 4. Fabricate a bracket to attach the Projector and camera to the yoke 5. Attach the bracket and connect the cabling. 6. Mount the whole system in the space where it is intended to be used. Re-wiring the Yoke Projector Bracket Attachment Point Tilt Stepper Motor Break-beam Optosensor and circular disk for Pan axis position feedback Camera, and cables for projector ti Stepper Motor Control Circuitry DMX interface Figure A.6 The pan and tilt yoke uncovered When purchased, the yoke had control wiring up from the base, through the arm, to the head for controlling the disco light lamp, colour filters and motors. All control wiring for the head was removed with the exception of five wires, retained for controlling the projector. Power and data cables for the projector and the camera firewire cable were cut then threaded through the yoke body and arm in the place of the removed wiring. The cut cables were then re-soldered inside the yoke base. As there are no slip-rings for transferring the power and data through the rotating bearings, the cables pass through central holes in the bearings to the projector attachment point. Consequently, as the cables will twist as the yoke pans and tilts, a loose loop of each cable was left in the base of the unit to reduce the strain on large movements (i.e. winding-up). An overview of the steerable projector cabling can be seen in Figure A.7 191 APPENDICES RS-232 Serial Cable Serial – DMX Converter DMX Cable Controlling PC Analogue VGA cable, 60Hz XGA Image Firewire cable, 27Hz 1280x1024 Image Steerable Projector Pan-Tilt Yoke Casio XJ-450 Projector PixeLink A742 Camera Figure A.7 Steerable Projector Cabling Layout Projector Bracket Manufacture The original pan and tilt yoke had the moving lamp head attached by five machine screws. When this was removed, a bracket was required to be fabricated to attach the projector in place of the light. Firstly a paper template of the underside of the projector was created, which had the projector mounting screw holes and cooling air intake holes marked. Then 3mm thick aluminium sheet was purchased and cut to size based on the template. Small holes were drilled in the bracket to aid airflow to the ventilation holes on the underside of the projector. The bracket side was then folded to an exact 90° angle using a metal forming machine in the Lancaster University Engineering Department. Finally the bracket was attached to the pan and tilt yoke and the projector mounted to the bracket in its upside down ceiling project mode. An extra strip of aluminium was added running around and below the projector on which to mount the camera, and give the bracket more strength. During mounted trials of the bracket and projector, the bracket appeared more flexible along the 90° fold line than was envisioned, so a second bracket was designed with a back plate to ensure the length of the bracket stayed at a 90° angle. The rear lip of the bracket was further turned up to provide increased stiffness and reduce the possibility of the projector weight bending the bracket down along its length. This redesigned bracket performs better, however, the projector still appears to “droop” slightly, with the side of the projector furthest from the yoke attachment point a couple of millimetres below the attachment side. In practice, the slightly rotated projector image this causes is not particularly visible and this rotation does not affect object detection or projection accuracy due to the rigidly attached camera. A dual fork system would remove this problem altogether by supporting both ends of the projector equally. 192 APPENDICES Figure A.8 (left) The fabricated projector bracket attached to the yoke, (centre) FLUIDUM room projector mounting direct to concrete [Butz, Schneider et al. 2004], (right) Wooden Bracing of the four threaded rods from which the projector is hung Installation Site Constraints As the projector was required to have a full 360 degree view of the lab environment, ceiling mounting in the centre of the area was the best mounting location. Two mounting possibilities for the Steerable Projector system presented themselves: 1. Mount the system directly on the concrete ceiling - which would be the most stable option, 2. Mount the system hung lower, so that only the moving arm hangs below the false ceiling. The first option (while preferable for stability) would have caused the projector to be above the false ceiling, requiring a very large hole be cut in the ceiling to allow the projector clear line of site to the walls, so it was decided to mount with the body of the Pan and Tilt yoke just above the false ceiling, and cut a small neat hole in the tiles for the arm to hang down. Typical mounting methods for pipes and ducting above false ceilings relies on attachments to threaded rods hanging down from the concrete ceiling, attached with shield anchors into the concrete, into which the threaded rod screws. This mounting method was also chosen for the steerable projector, with four rods – one at each corner of the pan/tilt base. The completed and mounted steerable projector can be seen in Figure A.9 (left). Bracing The projector and yoke are a dynamic system when the projector is moving, with projector inertia exerting a lot of force via the arm and stepper motor on the mountings. The steerable projector system approximates a heavy object on the end of long thin rods, acting similar to a pendulum. As the projector was mounted horizontally in landscape mode, when moving in the Pan axis a large torsional force is applied to the mountings due to a large portion of the projector weight being off-centre. Replacing the rods with a metal box-like structure with diagonal bracing struts would make the mounting more stable, but would require a commercial fabrication (most likely at great expense). Consequently, an attempt was made to brace the whole structure with wood to provide more stability and reduce movement of the mounting structure in response to projector movement, as shown in Figure A.8 (right). A.8.1 Moving Head Control The pan and tilt unit is controlled with the DMX512/1990 protocol. Up to 512 8-bit channels can be controlled from one DMX interface, or multiple channels can be combined for higher bit accuracy. An RS-232 serial to DMX interface converter was purchased from Milford Instruments, which accepts two sequential byte values from the PC’s serial port (i.e. channel number 0 to 255, channel value 0 to 255) for conversion to DMX. Data is sent over DMX using RS-485 signalling and cabling. DMX is transmitted serially at 193 APPENDICES 250kbps and is grouped into packets of up to 513 bytes. A full packet takes approximately 23ms to send, corresponding to a 44Hz update rate. The purchased yoke has four 8-bit channels for controlling the pan and tilt, with one 8-bit channel for movement speed control. The positioning channels are split into high and low byte values, and re-combined at yoke to form an absolute 16-bit position value. The DMX interface is a uni-directional control interface, hence there is no accurate position feedback or confirmation of commanded actions. This open-loop control system affects how software is written on the controlling PC, as it is not possible to simply issue a move command and assume the pan and tilt unit has moved instantaneously to the commanded position. Instead either an incremental approach must be made, with a timed control loop continuously sending small movement increments to try and reduce the commanded movement to physical position disparity, or the projector motion simulated in software using equations of motion, as discussed in section 6.2.3. It would be very useful to have exact position feedback when performing large manoeuvres of the unit, during high speed manoeuvring or for better performance when tracking objects with the camera. This can be accomplished by adding rotary optical position encoders to the stepper motor spindles. . Figure A.9 (left) The operational moving head steerable projector system, (centre) The tripod-mounted moving mirror steerable projector system, (right) Close-up of operational moving mirror yoke and pan-tilt camera A.9 Moving Mirror Steerable Projector Construction A moving mirror steerable projector has a more limited optical design than moving head projectors, with a smaller Field Of View (FOV) and increased optical distortion caused by rotation in the plane of the image when the mirror is panned. The closer to the projector lens, the smaller the mirror can be while still reflecting a full image. However, the closer to the projector lens the smaller the field of view as the projector itself blocks large areas that the mirror can point to. Hence, a balance must be found between mirror size, distance to projector and attainable field of view. The motion mechanism used to move the mirror is required to rotate the mirror in the two dimensions of pan and tilt independently. Consequently, systems similar to the design of the dual fork yoke (above) are often used. By mounting a mirror inside the yoke, these designs typically reach 180° pan and 60° tilt FOV. A first surface mirror (with silvering on the front, glass on the rear) can also be used for less distortion and to avoid the ghost images seen when using ordinary rear-surface mirrors. However, the front surfaces are extremely fragile as any contact with other objects can easily scratch the silvering (even when cleaning with a lens cloth). 194 APPENDICES The construction of the moving mirror steerable projector required seven steps: 1. 2. 3. 4. 5. Fabricate a bracket to attach the mirror pan and tilt yoke to the projector. Fabricate a dual fork pan and tilt yoke incorporating pan and tilt servos. Cut first-surface mirror to size, to allow pan rotation without fouling. Mount mirror onto yoke. Connect the projector, servo control and USB camera cables, as shown in Figure A.10. 6. Mount camera on projector. 7. Mount the whole system in the space where it is intended to be used. The constructed system was mounted on a display light tripod, with the projector 1.8m from the ground, as seen in Figure A.9 (centre). This allows some portability, as the tripod can be easily moved around in the environment. RS-232 Serial Cable Servo Controller Servo Cable Controlling PC Analogue VGA cable, 60Hz XGA Image USB2.0 cable, 30Hz 320x240 Image Steerable Projector Pan-Tilt Servos NEC MT-1065 Projector Logitech Sphere Camera Figure A.10 Moving Mirror Steerable Projector Cabling A.10 Characterisation of Moving Head Mechanism The basic specifications of the moving head steerable projector components were known from published literature. However, following construction, the mechanical performance of the combined system was unknown. We use the moving head steerable projector for the experiments and demonstration applications presented in this thesis, hence, the steering mechanism and projector characteristics (A2-A3) are evaluated. To characterise the moving head system a series of six experiments was performed to evaluate movement speed, field of view, positioning accuracy, positioning repeatability, positional stability and mechanism acceleration response. A.10.1 Design The Movement Speed evaluation consists of two experiments. The yoke purchased has a possible 220 rotational speed settings, however, no published information could be found on the corresponding angular speed of rotation. Five separate maximum speed 360° rotations in the Pan axis and five separate 180° rotations in the Tilt axis were timed and mean averaged. 195 APPENDICES This experiment was repeated, but with the yoke set to minimum speed. The five recorded times were again mean averaged. The addressable Field of View (FOV) was measured by moving the projector at minimum speed through five separate 360° rotations in the Pan axis and five separate 180° rotations in the Tilt axis, while measuring the difference in DMX control values. The formula used to calculate addressable field of view is: Total decimal DMX control range ⎞ ⎛ FOV Range(°) = ⎜ ⎟ *Angle moved ⎝ Measured difference in DMX control values ⎠ (A.1) The Positioning Accuracy of the yoke was measured as angular resolution, calculated using the angular field of view measurements from above divided by the addressable DMX range. To measure Positioning Repeatability three separate pan and tilt values were chosen, equivalent to aiming the projector at the centre of the North, East and South walls in the Experimental Systems Lab. A white cross was projected in the centre of the projectors image and lines drawn on the three walls equivalent to the cross. The projector was then moved five times 90° in a random pan direction and tilted 180° down (could not tilt up, as the yoke cannot physically move this far), then returned to the original values aimed at the centre of the wall. The Positional Stability was measured as the time it took for the whole steerable projector image to cease moving following arrival of the yoke at the commanded position after a 180° rotation (chosen to allow the yoke time to reach full speed). Ideally this figure would be close to zero, which would indicate a stable system with stiff mountings. Five 180° pans were performed at the yoke’s slowest movement speed, and the time taken for movement to cease on arrival at a pre-marked location was recorded. Similarly, five 180° tilt movements were performed at the yoke’s slowest speed. The Mechanism Acceleration Response is difficult to measure without position feedback or mounting accelerometers on the hardware. Similarly, inertia is difficult to calculate, due to the difficulties integrating the masses of the projector, bracket and arm in three dimensions. Hence, the projector movement and structural response was observed visually during 90°, 180° and 360° pan rotations and 180° tilt rotation for the characterisations above. A.10.2 Procedure For measurements of 90°, 180° or 360° rotations around the Pan axis, the following procedure was used: 1. The projector was used to project a large cross 1 pixel wide in the centre of its display. 2. The projector was rotated until horizontal, and perpendicular to centre of the South wall for 180 and 360° movements, or East wall for 90° moves. 3. The initial DMX setting was noted 4. The wall was marked in pencil with a line at the location of the vertical line of the projected cross. 196 APPENDICES 5. A destination line was drawn in the centre of another wall at the required rotation. No mark was required for 360° movements. 6. The projector was rotated until the projected line was aligned with the destination mark 7. The final DMX setting was noted. For measurements of 180° rotations around the Tilt axis, the following procedure was used: 1. The projector was used to project a large cross 1 pixel wide in the centre of its display. 2. The projector was rotated until horizontal, and perpendicular to the South wall. 3. The initial DMX setting was noted. 4. The height from the ceiling to the horizontal line of the cross was measured. 5. The North wall opposite was marked with a pencil line at the same height below the ceiling (we assume the ceiling is horizontal). 6. The projector was rotated until it had moved though a full 180° and the projected line was aligned with the mark on the North wall. 7. The final DMX setting was noted. A.10.3 Results Table A.4 Moving Head Steerable Projector Characterisation Characteristic Minimum Movement Speed Maximum Movement Speed Field of View Positioning Accuracy (Angular Resolution) Positioning Repeatability Positional Stability For Pan Positional Stability For Tilt A.10.4 Result 19°/s Pan 13°/s Tilt 180°/s Pan 180°/s Tilt 539.5° Pan 253° Tilt 0.016° Pan 0.004° Tilt <1mm error at 2.84m 6s at min speed 5s at mid-speed 5s at max speed <1s at min speed <1s at mid-speed 1s at max speed Discussion Field of View / Coverage Although the published FOV values were 540° Pan and 265° Tilt, the mean average of these calculated values was: 539.5° and 253.0° respectively (to the nearest 0.5°), as can be seen in Table A.4. 197 APPENDICES Positioning Accuracy The yoke control system maps a 0° rotation to a 16-bit DMX control value of 0 high byte, 0 low byte. In the pan axis the measured 539.5° FOV is mapped to values of 255 high byte, 255 low byte, while in the tilt axis it is the measured 253.0° mapped to values of 255 high byte, 255 low byte. While performing this characterisation of the positioning accuracy, it was discovered that, in reality, the claimed 16-bit Pan positioning accuracy of the yoke only resolved physically to 15-bits. On the low byte values, a decimal value greater than 131 overlaps into the next high byte range. For example, the position of the yoke effectively moves from 0,131 to 1,0 instead of 0,132. Positional Stability As shown in Table A.4, the slowest pan speed this was found to be an average of 6 seconds, due to the projector inertia causing oscillation in the steering arm and mountings. However, interestingly, the time required for stabilisation decreased if the yoke was panned faster. This suggests there may be either some structural resonance created with the slower speed pans, or the pan stepper motor is having trouble with overshoot and is hunting for the correct position. However, for tilt there was little oscillation at any movement speed. In fact, at the slowest and half tilt speeds the stabilisation time could not be accurately measured as it was less than a second. Mechanism Acceleration, Inertia and Response Time By subjective observation of the system, the current mounting of the projector appears to exhibit high inertia, as there is a large structural response exhibited to start and stop movement events in pan. This is reflected in the positional stability results as the long time taken for the projector to stop oscillating following a pan movement. A.11 Characterisation of Projector Although the published specifications give most relevant detail about a model of projector, it is possible for individual projectors to vary due to manufacturing tolerances. Consequently the purchased projector was characterised. Useful zoom range Printed text typically has a resolution between 150 and 600 dots per inch (dpi), whereas a computer monitor is around 72 dpi. With a 1024x768 pixel resolution projector, to project a 72dpi image the final image size would be a tiny 14.7 by 10.7 inch image. However, projectors are used primarily for their ability to display large images, rather than high resolution images. Acceptable resolution depends on viewing distance, with displays viewed from larger distances subtending a smaller angle in the eye, hence subjectively appearing higher resolution than if the same display was viewed at close range. Ways to boost resolution of projected displays (apart from using a higher resolution projector) by using multiple superimposed projected displays and super-resolution techniques are demonstrated by Majumder and Welch [Majumder and Welch 2001]. 198 APPENDICES The Casio XJ-450 projector has power zoom range settings of 0-40. When projecting perpendicular to a display surface at a distance of 2.89m, zoom settings above 20 produced images larger than 1.3 by 1m, but at low resolution with pixel sizes greater than 1mm2. This pixelation was very visible and occasionally distracting when viewed from 5m distance. As the majority of displays would be viewed from less than 5m distance in the demonstrations in Chapter 7, it was decided to limit the projector to zoom settings of 20 and below for interactive use. This provides a reasonable balance in the trade-off of keeping projected pixel sizes as small as possible (hence, display dpi as high as possible) with maximising projection area. Display Size versus Distance It is useful to know what size and resolution can be expected from a display at what distance, as this will determine what size objects can be projected on. Two zoom settings were measured – zoom 0 (smallest image possible) and zoom 20 (largest image). The width and height of an image displayed on a planar surface perpendicular to the projector was measured at a number of distances for both zoom settings. The results shown in Figure A.11 can be used in our Cooperative Augmentation architecture to calculate which objects are in the projector FOV, as discussed in section 6.3. Zoom Setting 20 Zoom Setting 0 1600 1400 1400 1200 Width (mm) 800 Height (mm) Linear (Width (mm)) 600 Linear (Height (mm)) 400 y = 488.87x - 23.732 1200 Image Size (mm) Image Size (mm) y = 358.27x - 16.893 1000 y = 269.47x - 13.787 200 Width (mm) 1000 Height (mm) 800 Linear (Width (mm)) 600 Linear (Height (mm)) y = 369.47x - 22.376 400 200 0 0 0 1 2 3 0 4 1 2 3 4 Distance From Centre Of Rotation To Perpendicular Planar Surface (m) Distance From Centre Of Rotation To Perpendicular Planar Surface (m) Figure A.11 (left) Image Size versus Distance at Zoom 0, (right) Image Size versus Distance at Zoom 20 (mid-zoom) Image Resolution at the centre of a display perpendicular to the projector was also measured at varying distances for both the zoom settings, as shown in Figure A.12. Zoom Setting 20 140 140 120 120 100 100 80 y = 75.914x -1.0277 DPI DPI Zoom Setting 0 60 80 40 40 20 20 0 y = 55.581x -1.0259 60 0 0 0.5 1 1.5 2 2.5 3 Distance From Centre of Rotation to Perpendicular Planar Surface (m) 3.5 0 0.5 1 1.5 2 2.5 3 3.5 Distance From Centre of Rotation to Perpendicular Planar Surface (m) Figure A.12 (left) Display resolution versus distance to projection surface at Zoom 0, (right) Display resolution versus distance to projection surface at Zoom 20 199 APPENDICES Focus versus Distance Most video projectors have a good depth of field. It was found empirically that the Casio XJ-450 has an approximate 1m depth of field, within which an acceptably focussed display will appear. For display on the walls of the eastern end of the experimental systems lab the focus can be set to a distance of 3m and it will appear acceptably in focus everywhere except in the very corners of the area. However, for applications which are required to track moving objects or project on very oblique surfaces in front of the walls, dynamic focussing is required. The projector focus settings were measured, again at the two zoom settings (zoom 0 and 20). The results shown in Figure A.13 are used in our Cooperative Augmentation architecture for dynamic focus control, as discussed in section 6.3.2. Zoom Setting 0 Zoom Setting 20 50 50 3 3 2 40 Focus Steps Focus Steps 2 y = 1.605x - 16.072x + 53.237x - 24.373 y = 1.23x - 13.226x + 47.42x - 22.467 40 30 20 10 30 20 10 0 0 0 1 2 3 4 5 6 Distance From Centre of Rotation to Perpendicular Planar Surface (m) 0 1 2 3 4 5 6 Distance From Centre of Rotation To Perpendicular Plane Surface (m) Figure A.13 (left) Focus steps versus distance to display surface at Zoom 0, (right) Focus steps versus distance to display surface at Zoom 20 A.12 Steerable Projector System Comparison The constructed moving head steerable projector system was compared to other steerable projector-camera systems known to exist [Pinhanez 2001; Yang, Gotz et al. 2001; Borkowski, Riff et al. 2003; Butz, Schneider et al. 2004; Ehnes, Hirota et al. 2004]. However, many steerable projector systems do not have published performance measurements; hence characteristics have only been compared where there is data available on a minimum of two systems. As can be seen in comparison Table A.5 to Table A.7, the three aspects of steering mechanism, projector and camera were all compared. For the steering mechanism, our pan and tilt yoke compares favourably with other systems. The greater pan FOV has benefits, as it potentially allows object tracking beyond the 360° position without the requirement for “unwinding”. However, it is this large range combined with the reduced 15-bit Pan positioning accuracy (discussed in A.9.5) which gives a much lower pan angular resolution than the FLUIDUM and Hirose Lab systems. 200 APPENDICES Table A.5 Steerable Projector Steering Mechanism Comparison Lancaster IBM FLUIDUM, Saarland + Munich INRIA, Grenoble Hirose Lab, Japan PixelFlex, UNC Steering Method Moving Head Moving Mirror Moving Head Moving Head Moving Head Moving Mirror Construction DIY DIY 540° Pan 265° Tilt 180°/s Pan 180°/s Tilt 230° Pan 50° Tilt Not published 0.016° Pan 0.004° Tilt Not published Mounting Fixed, Ceiling Size (Volume) 0.051m3 Weight 13kg Portable or Fixed Not published Not published Field of View Maximum Speed Positioning Accuracy (Angular Resolution) Purchased Publitec Beammover 333° Pan 270° Tilt Not published 0.005° Pan 0.004° Tilt Fixed, Ceiling Not published Not published 354° Pan 90° Tilt 146°/s Pan 80°/s Tilt Purchased Active Vision AV-4 360° Pan 240° Tilt 80°/s Pan 80°/s Tilt 0.11° Pan 0.18° Tilt 0.01° Pan 0.01° Tilt Not published Fixed, Ceiling Not published Not published Portable, Desktop Fixed, Ceiling Not published Not published DIY 0.13 m3 32kg DIY Not published Not published On paper the INRIA steering system does not appear particularly impressive, with its relatively low speed and effective 12-bit pan and 9-bit tilt positioning accuracy. However, the applications demonstrated by Borkowski et al. [Borkowski 2006] and associated videos show it to be a very capable system. This suggests that a viable steerable mechanism need not be faster or more accurate than the INRIA system when used for relatively close range projection applications. The Hirose Lab AV-4 system was classified as portable as Ehnes et al. demonstrated applications with the steerable projector sat on the desktop [Ehnes, Hirota et al. 2004], despite this, their system is more than twice as heavy as the Lancaster system, so portability is subjective in this case. Table A.6 Steerable Projector Video Projector Comparison Lancaster IBM FLUIDUM, Saarland + Munich Projector Casio XJ-450 Sharp XG-P10 Sanyo PLCXP45 Brightness 2800 3000 3300 Epson Powerlite 730c 2000 Contrast 1000:1 250:1 800:1 400:1 Resolution XGA XGA XGA XGA Weight 2.4kg 7.6kg 8.4kg 2.0kg Auto Focus and Zoom Yes Yes Yes No INRIA, Grenoble Hirose Lab, Japan PixelFlex, UNC Not published Proxima DP6850 3000 Not published XGA Not published 1500 Yes 100:1 XGA 6.0kg Yes It can be seen in Table A.6 that the projector brightness, contrast and weight in steerable systems varies considerably. All systems have been built and installed in relatively small room-sized environments with (theoretically) controllable or limited 201 APPENDICES natural light. Consequently the INRIA and PixelFlex systems should not be handicapped at all by their lower brightness. In contrast, while the Lancaster projector is suitable for use, it could still benefit from higher brightness, as sunny days cause the projector display to be washed-out and difficult to read in the lab environment. However, increasing brightness would increase projector weight and require a new pan and tilt yoke, or a change to a Moving Mirror steering mechanism. When comparing the camera characteristics in Table A.7, it can be seen that while the Lancaster camera system offers a higher resolution to other systems, it is limited to a maximum frame rate of 27fps. All other systems are capable of 30fps with exception of the INRIA system which uses a 25fps PAL camera. Although the FLUIDUM camera is capable of very high resolution still images, it is handicapped for dynamic applications by the low 320x240 video resolution. However, this low resolution is offset somewhat by the 3x optical zoom present on the camera, which allows dynamic zooming to achieve reasonable video performance for small field of view applications (such as projected button interaction detection). The PixelFlex camera was only used as a room camera, with no steerability. However, it was positioned so that it had a full view of the possible projection cones of all the steerable projectors in the multi-projector array [Yang, Gotz et al. 2001]. Table A.7 Steerable Projector Camera Comparison Lancaster IBM FLUIDUM, Saarland + Munich INRIA, Grenoble Camera PixeLink A742 Sony EVID100 Canon S40 Digital Camera Not published Resolution 1280x1024 Zoom Video Capable Steerable Projector Mounted Used with Room Cameras No 768x494 NTSC Yes Max 27fps Yes Yes No, pan and tilt camera No Yes (User following displays) 720x576 PAL Yes Hirose Lab, Japan Sony DFWVL500 PixelFlex, UNC North Carolina Not published Yes 720x480 NTSC No Yes Yes Yes Yes Yes Yes No No Yes No This was a room camera 4 Megapixel Yes Only at 320x240 640x480 A zoom capability on the Lancaster camera would also increase flexibility, allowing both a close-up view for interaction detection with projected images, and a large FOV for tracking of large objects in the lab area. Currently the projector image only covers a small portion of the whole image, further reducing the useable resolution of the image. A.13 Conclusion Two steerable projector-camera systems were successfully constructed – a moving mirror and a moving head system. In this chapter we identified characteristics important 202 APPENDICES to consider when purchasing or constructing a steerable projector-camera system. Finally, we evaluated the characteristics of the steering mechanism and projector in the constructed moving head system. This hardware and steerable system is used for the experiments and demonstration applications presented in this thesis. 203 APPENDICES Appendix B Smart Object Programming Examples The state machine model used in the smart object enables a developer to define both relationships between objects and the states of objects in which events should occur. The three examples below illustrate the consideration required when programming the state machine to use object input modalities defined in section 6.2.4 for interaction: B.1 Rough Handling Detection Objects in a warehouse use embedded accelerometers to detect rough handling and warn employees of possible damage. If rough handling is detected they request a projection on their surfaces asking an employee to check them for damage. In this case objects can use either the 1st input modality (location and orientation) or 4th modality (other physical manipulation of object) to determine when to project the message. The programmer must decide here whether to use camera based location information which may less accurately detect rough handling and is only available when the object is visible, or embedded sensors such as accelerometers which may be more accurate at detecting rough handling, but consume more power. Figure B.1 shows a diagrammatical representation of this program, designed using 4 states, an embedded accelerometer to detect rough handling and 2 projections. An employee is required to confirm they have seen the warning message and checked the object for damage in the second projection by clicking an interactive button. The typical state transition sequence is shown by the arrows; however, the state can change to “During Rough Handling” at any point as the limits are set to include all possible values for the button and projection variables. To test for state transition, in this application the result of all sensor operations are combined using boolean AND. 204 APPENDICES Start State State No Rough Handling Sensor During Rough Handling Operation Min Max Accelerometer Variance 0 800 Sensor Operation Accelerometer Button Equals 0 0 Button Projection Equals 0 0 Projection Min Max Variance 801 1000 Equals 0 1 Equals 0 2 Action None Action Request “Warning” Projection #1 State Employee Checks and Confirms State Rough Handling Stops Sensor Accelerometer Operation Min Max Sensor Variance 0 800 Accelerometer Operation Min Max Variance 0 800 Button Equals 1 1 Button Equals 0 0 Projection Equals 2 2 Projection Equals 1 1 Action Remove Projections Action Request “Check Me” Projection #2 With Interactive Confirmation Button Figure B.1 State Machine Program for Detecting Rough Handling of a smart object B.2 Smart Book A smart book object allows mixing of traditional fixed printed media with dynamic projected content augmentations at runtime. The book itself is articulated and can be physically opened. Either force sensors or simple binary contact sensors (modelled as a switch) on each page sense when it is opened and which page is open. In this case, when the book is opened, it changes in both appearance and geometry. Tracking while object geometries change is a significant challenge for traditional vision based detection systems. However, using the 2nd input modality (manipulating the geometry) the object detects each page opening event and automatically updates the projector-camera system with a new appearance and geometry, so that tracking can continue uninterrupted. Figure B.2 shows a diagrammatical representation of this simple program. In this case force sensors are used on each page, and a simple boolean AND combines the sensor information and decides which state to enter. For a book the typical state transitions are either being opened or closed at a random page, or flipping forwards or backwards through the pages. These typical transitions are again shown by the arrows. This type of smart book program is typical of many other objects which either have a single sensor, or large number of the same sensor. For example, the smart cup [Gellersen, Beigl et al. 1999] contains a single temperature sensor. If we want to display this temperature on the cup itself we could use a similar program, where each value range of temperature (for example, each degree) is a separate state. 205 APPENDICES Start State Book Closed Sensor Operation Min Max Force Sensor1 Mean 1 100 Force Sensor2 Mean 1 100 Force Sensor3 Mean 1 100 Force Sensor n Mean 1 100 Min Max Action None State Book Open Page 1 Sensor Operation State Book Open Page 2 Sensor Operation State Book Open Page 3 Operation State Min Max 0 0.999 Force Sensor1 Mean 1 100 Force Sensor1 Mean 1 100 Force Sensor1 Mean 1 100 Force Sensor2 Mean 1 100 Force Sensor2 Mean 0 0.999 Force Sensor2 Mean 1 100 Force Sensor2 Mean 1 100 Force Sensor3 Mean 1 100 Force Sensor3 Mean 1 100 Force Sensor3 Mean 0 0.999 Force Sensor3 Mean 1 100 Force Sensor n Mean 1 100 Force Sensor n Mean 1 100 Force Sensor n Mean 1 100 Force Sensor n Mean 0 0.999 Action Request Projection Page #2 Action Min Max Book Open Page n Mean Request Projection Page #1 Max Sensor Force Sensor1 Action Min Request Projection Page #3 Sensor Action Operation Request Projection Page #n Figure B.2 State Machine Program for Articulated Smart Book with Force Sensors on Each Page B.3 Smart Furniture Assembly The smart furniture described in section 2.2.2 has embedded sensing to guide the purchaser with the task of assembly[Antifakos, Michahelles et al. 2002; Holmquist, Gellersen et al. 2004]. For this example we replace the signalling LEDs with projected displays to notify the user of the next task in the assembly sequence. A message about how to assemble each piece is projected in the right order when they are moved together into the same location. Binary contact sensors (modelled as a switch) detect when each part is correctly assembled. In this example each object uses the 1st (location and orientation) 4th (other sensors on an object) and 7th input modalities (requesting information about other objects) to calculate its location and orientation relative to other objects and determine with its sensors whether it is correctly assembled. Relative location detection is possible as all object share the common world coordinate system. The objects request projections to guide the user in sequential assembly based on the pre-defined series of states shown in Figure B.3. The program requires that the later assembled objects have explicit knowledge of the identity and sensor values of earlier objects, to ensure correct assembly sequence by monitoring their sensors. For example, object 3 can monitor the switch states on object 1 and 2 to know when they are assembled and hence, when to project its display. However, in constrained scenarios such as this the identity and capabilities of each part are fixed and known a-priori by a designer; hence this requirement is not a limitation. In this case our furniture consists of three separate pieces which need to be assembled in sequential order. These separate pieces are each a smart object and use a simple boolean AND to combine the sensor information when deciding which state to enter. The typical state transitions are again shown by the arrows. 206 Object 3 Start Object 2 Start Object 1 Start 0 999 Equals Request Projection #1: Arrow toward Object 1 Switch2 Action 0 0 0 999 0 999 Max Equals Range Request Projection #1: Arrow toward Object 2 Object1 Action Range Switch Operation Sensor Object2 0.5 0 0.5 Min State Range Equals Switch2 Object1 999 Action Switch Object2 Sensor State 0 0 0 0 Min 0 999 0 0.499 Max 0 0 0 Min 0.499 0 0.499 Max State Action Object1 Switch Object2 Sensor State Action Switch2 Object3 Switch1 Object1 Sensor State Action Switch Object2 Sensor 0 1 0 Min Remove Projections Range Equals Range Operation Object 2&3 Assembled 0 0 1 0 Min Request Projection #1: Arrow toward Object 3 Equals Range Equals Range Operation Object 1&2 Assembled, Object 3 Not in Proximity 1 0 Min Remove Projections Equals Range Operation Object 1&2 Assembled 0.499 1 0.499 Max 0 999 1 0.499 Max 1 0.499 Max Action Switch2 Object3 Switch1 Object1 Sensor State Equals Range Equals Range Operation 0 0 1 0 Min Object 1&2 Assembled, Object 3 In Proximity 0 0.499 1 0.499 Max Request Projection #2: Detailed Arrows and Assembly Instructions Figure B.3 Smart Furniture Assembly Program for 3 Individual Pieces Request Projection #2: Detailed Arrows and Assembly Instructions Range Equals Range Operation Not Assembled, Object 2&3 in Proximity Request Projection #2: Detailed Arrows and Assembly Instructions Range Object3 Equals Operation Action 0 0.499 Max Not Assembled, Object 1&2 in Proximity Object1 Switch1 0 0 Min Request Projection #2: Detailed Arrows and Assembly Instructions Equals Range Operation Not Assembled, Object 1&2 in Proximity Sensor State Action Switch Object2 Sensor 0 999 Max Not Assembled, Object 2&3 Not in Proximity Range Object3 0 0.5 Min Equals Range Switch1 Operation Object1 State 0 0.5 Max Not Assembled, Object 1&2 Not in Proximity Request Projection #1: Arrow toward Object 2 Equals Sensor State Action Switch Range Object2 Min Not Assembled, Object 1&2 Not in Proximity Operation State Sensor Action Switch2 Object3 Switch1 Object1 Sensor State 1 0 1 0 Min Remove Projections Equals Range Equals Range Operation Object 1&2&3 Assembled 1 207 0.499 1 0.499 Max APPENDICES APPENDICES B.4 Smart Cooking Object Model This section accompanies the Smart Cooking demonstration presented in Chapter 7. Here we present the content model states for each of the objects (egg box, pan, salt and stove) grouped by the recipe state variable and arranged in recipe state order 0-12. Egg Box Pan 0 State Sensor Recipe Light 1 - Closed State Operation Min Value Value Max Sensor 0 2 Recipe Force 0 0 Combination Method Sensor Recipe Light 2 - Open Value Value Combination Method 0 2 Value Value Max Sensor 0 9999 Recipe Force 0 29g AND 2 – Egg Added Operation Min Value Value Combination Method AND Action Project “Add egg to pan” Max 0 2 Action Project “Add egg to pan” State Operation Min Operation Min Combination Method AND Action Project “Add egg to pan” State 1 – Empty Max 0 30g 0 9999 AND Action Remove Projection, Send 1 1 State Sensor Recipe 3 – Finished with Eggs Operation Min Value 1 Combination Method Action AND Remove Projection Max 1 State Sensor Recipe Force 3 – Add water to pan Operation Min Value 1 Value 30g Combination Method Max 1 9999 AND Action Project “Add water to pan” State Sensor Recipe Force Temp 4 – Place on stove Operation Min Value 1 Value Value Combination Method 0 0 Max 1 1 15º AND Action Project place on stove, Send 2 2 State Sensor 5 – On Stove Operation Min Recipe Value Force Value Object Stove Distance Combination Method 2 30g 0 Max 2 9999 0.25m AND Action Remove Projection, Send 3 Figure B.4 Hard-Boiled Egg Recipe States 0-2 208 APPENDICES Salt Pan 3 State 6 – Add Salt to Pan Sensor Operation Min Recipe Value 3 Force Value 30g Object Stove Distance Combination Method 3 0 1 – Add Salt, Salt Static State Max 9999 0.25m Sensor Recipe Force Operation Min Value 3 Value 10 Object Pan Distance 0.25m Combination Method AND Action Project “Add Pinch of Salt” Action State Max 3 250g 9999 AND Project “Add Pinch of Salt” 2 – Add Salt, Salt Mobile Sensor Recipe Operation Min Value 3 Force Object Pan Value Distance Combination Method 0 0 Max 3 1 0.25m AND Action Project “Add Pinch of Salt”, Send 4 4 State Sensor Recipe Force 3 – Salt Added, Salt Static Operation Min Value 4 Value 10 Combination Method Action Max 4 250g AND Remove Projection, Send 5 5 Stove State 1 – Turn to 100% setting Sensor Recipe Gas Operation Min Value Value Combination Method Max 5 0 5 99% AND Sensor Recipe Gas Sensor Operation Min Recipe Value Combination Method Action Project Turn on, 100% setting State 7 – Salt Added State Action 5 Max 5 AND Remove Projection 2 – At 100% setting Operation Min Value Value Combination Method 5 100% Max 5 100% AND Action Remove Projection, Send 6 Figure B.5 Hard-Boiled Egg Recipe States 3-5 209 APPENDICES Pan 6 State Sensor Recipe Force 8 – Boil Water Operation Min Value Value Object Stove Distance Max 6 30g 0 6 9999 0.25m AND Combination Method Action Project Temp X, Wait for Boil State Sensor 9 – Boiling Water Operation Min Value 6 6 Force Value 30g 9999 0 0.25m 100 100º Object Stove Distance Stove Max Recipe Temp Value AND Combination Method Action Project Temp X, Send 7 7 3 – Turn to 40% setting State Sensor Recipe Gas Gas Operation Min Value 7 Value 0 Value 50 Max 7 30% 100% Combination Method (1 AND 2) OR 3 Action Project Turn to simmer (40%) State 4 – Around 40% setting Sensor Recipe Gas Operation Min Value 7 Value 31 Combination Method Max 7 49% AND Action Remove Projection, Send 8 8 State State Sensor 5 – Pan Too Cold Sensor Operation Min Recipe Value 8 Object Pan Temp 0 Combination Method Max 8 90º AND Action Project Pan too cold, adjust to simmer 10 – Simmering 7 mins Operation Min Max Recipe Value 8 8 Force Value 30g 9999 0 0.25m 0 7min Object Stove Distance Time Value Combination Method AND Action Project countdown timer Figure B.6 Hard-Boiled Egg Recipe States 6-8 210 APPENDICES 8(continued) Pan Stove State 6 – Pan Too Hot Sensor Operation Min Recipe Value 8 Object Pan Temp 99 Combination Method AND Action Project Pan too hot, adjust to simmer State 7 – Pan Correct Temp Sensor Operation Min Recipe Value 8 Object Pan Temp 91 Combination Method Action 11 – Simmering 7 mins State Max 8 9999 Max 8 98º Sensor Operation Min Max Recipe Value 8 8 Force Value 30g 9999 0 0.25m Object Stove Distance Time Value Combination Method 7min 7min AND Action Remove Projection, Send 9 AND Remove Projection 9 State 8 – Turn Off Stove Sensor Recipe Gas Operation Min Value Value Combination Method 9 1 Max 9 100% AND Action Project Turn Off Stove State Sensor Recipe Gas 9 – Stove Off Operation Min Value Value Combination Method 9 0 Max 9 0% AND Action Remove Projection, Send 10 10 State 12 – Empty Water In Sink Sensor Operation Min Recipe Value Force Value Object Stove Distance Combination Method Action 10 30g 0 Max 10 9999 0.25m AND Project Empty Water In Sink Figure B.7 Hard-Boiled Egg Recipe States 8-10 211 APPENDICES 10(continued) Pan State 13 – Emptying Water Sensor Recipe Operation Min Value 10 Force Value Object Stove Distance Object Sink Distance Combination Method Max 10 0 1 0.25m 9999 0 0.25m AND Action Project Empty Water In Sink, Send 11 11 State 14 – Water Emptied Sensor Recipe Operation Min Value 11 Max 11 Force Value 0 1 Object Sink Distance 0 0.25m Temp Value 0 70º Combination Method AND Action Project “Enjoy Your Meal”, Send 12 12 State 15 – End Sensor Recipe Force Operation Min Value Value Combination Method Action 12 30 Max 12 9999 AND Remove Projection Figure B.8 Hard-Boiled Egg Recipe States 10-12 212
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
advertisement