AIX 5L Internals
Student Guide
Version 20001015
IBM Web Server
Knowledge Channel
Student Guide
Draft Version for review, Sunday, 15. October 2000, title.fm
Tradmarks
IBM® is a registered trademark of International Business Machines Corporation.
UNIX is a registered trademark in the United States, other countries, or both and is licensed
exclusively through X/Open Compnay Limited.
<<< list any other Trademarks used int he course materials >>>
July 2000 Edition
The information contained in this document has not been submitted to any formal IBM test and is distributed on
an “as is” basis without any warranty either express or implied. The use of this information or the
implementation of any of these techniques is a customer responsibility and depends on the customer’s ability
to evaluate and integrate them into the customer’s operational environment. While each item may have been
reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or simular results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their
own risk.
© Copyright International Business Machines Corporation 2000. All rights reserved. This document may not be
reproduced in whole or in part without the prior written permission from IBM. Information in this course is
subject to change without notice.
Web Server Knowledge Channel
Technical Education
Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Student Guide
Contents
Kernel Overview
Kernel Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Kernel states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
Kernel exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
Kernel Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
Kernel Limits Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
64-bit Kernel base enablement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-17
64-bit Kernel stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24
CPU big- and little-endian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26
Multi Processor dependent designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-28
Command and Utility compatibility for 32-bit and 64-bit kernels . . . . . . . . . . . . . . . . 1-29
Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-30
Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-33
Interrupt handling in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-35
Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-36
Handling CPU state information at interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-37
IA-64 Hardware Overview
IA-64 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
IA-64 formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
IA-64 memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
IA-64 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
IA-64 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
IA-64 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Power Hardware Overview
Power Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Power CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
64 bit CPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
SMP Hardware Overview
SMP Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Configuring System Dumps on AIX 5L
About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
System Dump Facility in AIX5L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Configuring for System Dumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Obtaining a Crash Dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Dump Status and completion codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
dumpcheck utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Verify the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Packaging the dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26
Introduction to Dump Analysis Tools
About This Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
System Dump Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
dump components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Dump creation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Component dump routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
bosdebug command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Contents iii
Student Guide
Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Memory Overlay Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
System Hang Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
truss command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
KDB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23
kdb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25
KDB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26
KDB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29
KDB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33
KDB trace sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-36
KDB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38
KDB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-42
KDB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43
KDB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45
KDB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-47
KDB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-49
KDB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50
KDB Kernel stack sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-58
KDB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-60
KDB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-62
KDB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-65
KDB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-69
KDB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-72
KDB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-77
KDB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-80
KDB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-86
KDB data and instruction block address translation sub commands . . . . . . . . . . . . 6-87
KDB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-89
IADB kernel debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-90
iadb command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-92
IADB break point and step sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-93
IADB dump/display/decode sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-96
IADB modify memory sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-100
IADB name list/symbol sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-105
IADB watch break point sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-106
IADB machine status sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-108
IADB kernel extension loader sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-110
IADB address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-111
IADB process/thread sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-112
IADB LVM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-114
IADB SCSI sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-115
IADB memory allocator sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-116
IADB file system sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-117
IADB system table sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-118
IADB network sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-119
IADB VMM sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-120
IADB SMP sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-122
IADB block address translation sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-123
IADB bat/brat sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-124
IADB miscellaneous sub commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-125
iv AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Student Guide
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-127
Process Management
Process Management Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Process operations fork() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Process operations exec() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
Process operations exit system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Process operations, wait() system call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Kernel Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Thread Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
AIX Thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
Thread Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
Threads Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
Thread states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25
Thread Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27
Process swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28
Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
The Dispatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33
AIX run queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36
Process and Threads data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39
Process and Threads data structures addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43
What is new in AIX 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-48
Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-50
Signal handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-51
Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-53
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-57
Memory Management
Overview of Virtual Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Memory Management Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Demand Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
Memory Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Memory Object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Page Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Page Not In Hardware Frame Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Page on Paging Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Loading Pages From The Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Filesystem I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Free Memory and Page Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
vmtune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
Fatal Memory Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Memory Objects (Segments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23
Shared Memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
shmat Memory Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Memory Mapped Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28
IA-64 Virtual Memory Manager
IA-64 Addressing Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Region Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Contents v
Student Guide
Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Single vs. Multiple Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
AIX 5L Region Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
LP64 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
ILP32 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
LVM
Logical Volume Manager overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Data Integrity and LVM Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
LVM Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
LVM Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
Physical disk layout Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21
VGSA structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30
Physical disk layout IA-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31
LVM Passive Mirror Write Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36
AIX 5 LVM Hot Spare Disk in a Volume group. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40
LVM Hot spot management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42
LVM split mirror AIX 4.3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45
LVM Variable logical track group (LTG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46
LVM command overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47
LVM Problem Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48
Trace LVM commands with the trace command . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-51
LVM Library calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56
logical volume device driver LVMDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57
Disk Device Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58
Disk low level Device Calls such as SCSI calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-61
Enhanced Journaled File System
J2 - Enhanced Journaled File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Allocation Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
Filesets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Binary Trees of Extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
File Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19
fsdb Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23
Exercise 1 - fsdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24
Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27
Directory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-31
Exercise 2 - Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-35
Logical and Virtual File Systems
General File System Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Logical File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
User File Descriptor Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
System File Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Virtual File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9
Vnode/vfs interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
vi AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Student Guide
Vnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
vfs and vmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
File and Filesystem Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
gfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
vnodeops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16
vfsops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17
The Gnode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18
Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20
Lab Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21
Lab Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26
AIX 5L boot
What is boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
Various Types of boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Systems types and Kernel images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
RAMFS and prototype files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Boot Image Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8
AIX 5L Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12
The Power Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
Power boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14
AIX 5L Power boot record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20
Power boot images structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21
RSPC boot image hints header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22
CHRP Boot image ELF structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-24
CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-25
CHRP boot image ELF structure - Continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-26
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-28
Power ROS and Softros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-30
IPLCB on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-31
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-33
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34
The IA-64 Boot Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-35
IA-64 boot disk layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-37
EFI boot manager and boot maintenance manager overview . . . . . . . . . . . . . . . . 14-39
EFI Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-40
IA-64 Boot Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-43
IA-64 Initial Program Load Control Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-44
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-45
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-46
Hard Disk Boot process (rc.boot Phase I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-47
Hard Disk Boot process (rc.boot Phase II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
Hard Disk Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-49
CDROM Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-50
Tape Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-51
Network Boot process (rc.boot Phases I, II and III) . . . . . . . . . . . . . . . . . . . . . . . . 14-52
Common Boot process (rc.boot Phase III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-53
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Contents vii
Student Guide
Draft Version for review, Sunday, 15. October 2000, intTOC.fm
Network boot $RC_CONFIG files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-54
The init process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-56
ODM Structure and usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-57
boot and installation logging facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-63
Debugging boot problems using KDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-65
Debugging boot problems using IADB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-67
Packaging Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-69
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-71
Instructor Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-72
/proc Filesystem Support
/proc Filesystem Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
Types of Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-4
The as File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5
The ctl File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
The status File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
The psinfo file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10
The map File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
The cred File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13
The sigact File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14
lwp/lwpctl file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-15
The lwp/lwpstatus File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16
The lwp/lwpsinfo File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19
Control Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20
PCSTOP, PCDSTOP, and PCWSTOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-21
PCRUN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-23
PCSTRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-25
PCCSIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-26
PCKILL, PCUNKILL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-27
PCSHOLD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-28
PCSFAULT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29
Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-34
Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-35
viii AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Unit 1. Kernel Overview
This overview describes the concepts used in the AIX 5L kernel.
What You Should Be Able to Do
After completing this unit, you should be able to
• Identify major components of the kernel.
• Identify the major differences between AIX 5L and previous versions
of AIX.
• Determine what kernel to use.
• Determine what the kernel limits are.
• Find out if a thread is in user or kernel model.
• Define the kernel address layout.
• Describe the steps the kernel takes in handling an interrupt or
exception.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Overview
Introduction
Up until AIX 5L, the kernel was a 32-bit kernel for Power architecture only.
AIX version 4.3 introduced the 64-bit application enabling on Power, which
meant there was still a 32-bit kernel, but an 64-bit environment was
available through a kernel extension which did the appropriate Now AIX 5L
features both a 32-bit and a 64-bit kernel on Power systems, and a 64-bit
kernel on the IA-64 architecture.
This overview describes the concepts used in the kernel in general and in
the 64-bit kernel specifically.
Kernel
description
The kernel is the base program of the computer. It is an intermediary
between the applications and the computer hardware. There is no need for
applications to have specific knowledge of any kind of hardware.
Processes, that is, programs in execution or running programs, just ask for
a generic task to complete (like ‘give me this file’) and the kernel will go out
and get it.
The kernel is the first and most important program on the computer. It can
access things other programs can not. It can create and destroy processes
and it controls the way programs run. Resource usage is balanced by the
kernel in order to keep everybody happy.
Functions of
the kernel
The kernel provides the system with the following functions:
• Create, manage and delete processes.
• Schedule and balance resources.
• Provide access to devices.
• Handle asynchronous events.
The kernel manages resources so they can be shared simultaneously
among many processes and users. Resources can be physical like the
CPU, the memory or an adapter, or it can be virtual, like a lock or a slot in
the process table.
Uniprocessor
support
The 64-bit kernel is aimed at the high-end server environment and
multiprocessor hardware. As a result, it is optimized strictly for the
multiprocessor environment and no separate uniprocessor version is
provided.
Continued on next page
-2 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Overview
64-bit vs. 32bit kernel
Guide
-- continued
The primary purpose of the 64-bit AIX kernel is to address the fundamental
need for workload scalability. This is achieved through a kernel address
space which is large enough to support increases in software resources.
The demands placed on the system software by customer applications will
soon outstrip the existing AIX 32-bit kernel because of the 32-bit kernel’s
limited address space. At 4GB, this address space is simply too small to
efficiently and/or effectively handle the amount of software resources
needed to support the projected 2001 workloads and hardware. In fact, a
number of software resources pools within the 32-bit kernel are now under
pressure from today’s application workloads.
32-bit kernel
life time
Customers have made and will continue to make significant investment in
32-bit RS/6000 hardware systems and need system software that protects
this investment. Thus, AIX also offers a 32-bit kernel.The RS/6000
software plan is to eventually drop support for the 32-bit kernel. However,
support will not be withdrawn before 2002 and after the initial 64-bit kernel
release. This process is driven by end-of-life plans for 32-bit hardware
systems, as well as the fact that customers require a bridge period under
which both the 32-bit and 64-bit kernels are available for 64-bit hardware
systems and offer the same basic functionality. This period is needed to
ease migration to the 64-bit kernel.
Compatibility
Customers need system software that protects their investment in existing
applications and provides binary and source compatibility. AIX 5L will
therefore maintain support for existing 32-bit applications.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Overview
Kernels
supported by
hardware
platform
-- continued
The table below shows which kernels are supported on different systems.
In general, a 64-bit kernel and application can only run on 64-bit hardware,
but 64-bit hardware can execute 32- and 64-bit kernels and applications.
32-bit Power
64-bit Power
Intel IA64
32-bit Kernel 32-bit applications 32-bit applications
64-bit Kernel Not supported;
32-bit applications 32-bit applications
64-bit kernel is not 64-bit applications 32-bit applications
supported at 32-bit
CPUs
Currently, there are three different CPUs types in the RS/6000 systems
(only the PowerPC 604e CPU is 32-bit).
CPU
PowerPC 604e
Power3-II
RS64 II
RS64 III
Binary
compatibility
and limitations
Type
32-bit
64-bit
64-bit
64-bit
The 64-bit kernel offers binary compatibility to existing applications for both
32-bit and 64-bit applications. However, it does not extend to the minority
of applications that are built non-shared or have intimate knowledge of
internal details, such as programs accessing /dev/kmem or /dev/mem.
This is consistent with the general AIX policy for these two classes of
applications.
In addition, binary compatibility will not be provided to applications that are
dependent on existing kernel extensions that are not ported to the 64-bit
kernel environment. Only 64-bit kernel extensions will be supported. This
direction is taken to avoid the significant cost of providing 32-bit kernel
extension support under the 64-bit kernel, and is consistent with the
directions taken by other UNIX vendors such as SUN, HP, DEC and SGI.
On the plus side, this direction also forces kernel extensions to migrate to
the more scalable and strategic 64-bit environment (to better face the next
century).
Continued on next page
-4 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Overview
Guide
-- continued
Compatibility
for kernel
extensions
There is no change to the compatibility provided for 32-bit kernel
extensions under the 32-bit kernel. 64-bit kernel extensions will not be
supported under the 32-bit kernel.
Compatibility
for system
calls
One important aspect of binary compatibility involves the required
functional behavior of system call APIs when supplied invalid user
addresses. Under today’s 32-bit kernel, this behavior differs in many ways
for 32-bit and 64-bit applications. For 32-bit applications, APIs return errors
(that is, EFAULT errno) to the application if presented with an invalid
address. This behavior is due to the fact that all user space accesses that
are made under an API inside the kernel, and under the protection of
kernel exception handling. For 64-bit applications, an invalid user address
will cause a signal (SIGSEGV) to be sent to the application. This occurs
because structure reshaping is done in supporting API libraries and it is
the user mode library routine that accesses the invalid user (structure)
address.
Today’s kernel behaviors is preserved by the 64-bit kernel for 32-bit
applications but not for 64-bit applications. This is because the behavior for
64-bit applications under the 32-bit kernel will be changed and made
consistent with that now provided for 32-bit applications. This is done for a
number of reasons.
First, it is difficult to fully preserve the present behavior for 64-bit
applications. Reshaping is not required for these applications under the
64-bit kernel, so there will be no library accesses. Signals could be sent as
part of kernel exception handling, but it would be hard to produce the same
signal context as is seen under the 32-bit kernel.
Next, the functional behaviors of 32-bit and 64-bit applications should only
differ in places where there are fundamental application differences, like
address space layout. Introducing different behaviors in other places only
complicates matters for application writers.
Finally, both the errno and signal behaviors are allowable under the
standards, but the errno behavior offers a more friendly application
programming model.
In order to provide a consistent behavior across kernels and applications,
all structure reshaping is performed inside both kernels for both application
types.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Overview
-- continued
Source
compatibility
Source code compatibility is preserved for applications and 32-bit kernel
extensions. Consistent with general AIX policy, this extends to makefiles
(build mechanisms), but not to the small set of applications that rely upon
shipped header file contents that are provided only for use by the kernel.
Programs accessing /dev/mem or /dev/kmem serve as an example of such
applications.
32-bit vs. 64bit kernel
Performance
on Power
The 64-bit kernel is intended to increase scalability of the RS/6000 product
family and is optimized for running 64-bit applications on the upcoming
Gigaprocessor systems (Power4, which will be announced in 2001). The
performance of 64-bit applications running on the 64-bit kernel on
Gigaprocessor-based systems is better than if the same application was
running on the same hardware with the 32-bit kernel. This is because the
64-bit kernel allows 64-bit applications to be supported without requiring
system call parameters to be remapped or reshaped. The 64-bit kernel
may also be compiler-optimized specifically for the Gigaprocessor system,
whereas the 32-bit kernel may be optimized to a more general platform.
32-bit
application
Performance
on 32-bit and
64-bit kernels
The 64-bit kernel will also be optimized for 32-bit applications (to the extent
possible). This is because 32-bit applications now dominate the application
space and will continue to do so for some time. In fact, performance tradeoffs involving 32-bit versus 64-bit applications should be made in favor of
32-bit applications. However, 32-bit applications on the 64-bit kernel will
typically have less performance than on the 32-bit kernel, because call
parameter reshaping is required for 32-bit applications on the 64-bit kernel.
64-bit
application
and 64-bit
kernel
performance at
non
Gigaprocessor
systems
The performance of 64-bit applications under the 64-bit kernel on nonGigaprocessor systems may be less than that of the same applications on
the same hardware under the 32-bit kernel. This is due to the fact that the
non-Gigaprocessor systems are intended as a bridge to Gigaprocessor
systems and lack some of support that is needed for optimal 64-bit kernel
performance. In addition, efforts should be made to optimize 64-bit kernel
performance for non-Gigaprocessor system, but performance trade-offs
are made in the favor of the Gigaprocessor.
Continued on next page
-6 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel overview
Guide
-- continued
32-bit and 64bit kernel
extension
performance at
Gigaprocessor
systems
The performance of 64-bit kernel extensions on Gigaprocessor systems
should be the same or better than their 32-bit counterparts on the same
hardware. However, the performance of the 64-bit kernel extension on
non-Gigaprocessor machines may be less than 32-bit kernel extensions
on the same hardware. This flows from the fact that 64-bit kernels are
optimized for Gigaprocessor systems.
Kernel
characteristics
Since the kernel is a program itself, it behaves almost like any other
program. It’s features are:
• Preemptable
• Pageable
• Segmented
• 64-bit
• Dynamically loadable
Preemptable means that the kernel can be in the middle of a system call
and be interrupted by a more important task. The preemption causes a
context switch to another thread inside the kernel.
Some parts of the kernel are pageable, which means they are not needed
in memory all the time, and can be paged to paging space.
Both the 32-bit kernel and the 64-bit kernel implement virtual address
translation by using segments. In previous versions of AIX, segment
registers were used to map segments to thread contexts. Now segment
tables are being used.
The kernel can be dynamically extended with extra functionality.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel states
kernel system
diagram
trap (Power)
user programs
libraries
User level
Kernel level
system call interface
file subsystem
process
buffer cache
character
control
block
subsystem
device driver
Inter-process
Communication
scheduler
memory
management
hardware control
Kernel level
Hardware level
hardware
Roughly there are three distinct layers:
• The user level
• The kernel level
• The hardware level
This diagram shows how the kernel is the interface between the user level
and the hardware. Applications live at the user level, and they can only
access hardware, like a disk or printer, through the kernel.
Process
execution
modes
Processes can run in two different execution modes: kernel mode and user
mode.These modes are also referred to as Supervisor State and Problem
State.
Continued on next page
-8 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel states
User mode
protection
domain
Guide
-- continued
A process running in user mode can only affect its own execution
environment and runs in the processor’s unprivileged state. In user mode,
a process has read/write access to the user data process private segment
and the shared library data segment. It also has access to the shared
memory segments using the shared memory functions. The process in
user mode has read access to the user text and shared library text
segment.
User mode processes can still use kernel functions by means of a system
call. Access to functions that directly or indirectly invoke system calls are
typically provided by programming libraries which gives access to
operating system functions.
Kernel mode
protection
domain
Code running in this mode has read/write access to global kernel space
and access to kernel data in the process private segment when running
within the process context. Code in interrupt handlers, the base kernel and
kernel extensions run in kernel mode. If a program running in kernel mode
needs to access user data, a kernel service is used to do so. Programs
running in kernel mode can use kernel services, can access global system
data, are exempt from all security restraints, and run in the processor
privileged state
In short:
User mode or problem state:
• User programs and applications run in this mode.
• Kernel data and global structures are protected from access/
modification.
Kernel mode or supervisor state:
• Kernel and kernel extensions run in this mode.
• Can access or modify anything.
• Certain instructions limited to supervisor state only.
The kernel state is part of the thread state, so this information typically is
kept in the threads Machine State area (MST).
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel exercise
Exercise:
figuring out
thread state on
Power
Look at the value of the Machine State Register (MSR) for thread of
interest:
# echo “mst <thread slot>”| kdb | grep msr
iar
: 0000000000009444 msr
: A0000000000010B2
cr
: 31384935
From /usr/include/sys/machine.h :
#define
MSR_PR
0x4000
/* Problem state */
This means that if bit 15 from the MSR is set, the thread is running in user
mode, that is, when the fourth nibble from the right is either 4,5,6,7 or
C,D,E,F.
Continued on next page
-10 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Kernel exercise
Exercise:
figuring out
thread state on
IA-64
-- continued
Look at the value of the Interrupt Processor State Register (IPSR) for
thread of interest.
On an interrupt, and if PSR.ic (Interrupt Collection) is 1, the IPSR receives
the value of the PSR. The IPSR, IIP and IFS are used to restore the
processor state on a Return From Interrupt (rfi). The IPSR has the same
format as PSR. IPSR.ri is set to 0, after any interruption from the IA-32
instruction set.
# iadb
(0)> ut -t <thread-ID>
*ut_save: 0x0003ff002ff3b400 *ut_rsesave: 0x0003ff002ff3bf50
System call state:
ut_psr: 0x00001053080ee030
... more stuff...
(0)>mst 0x0003ff002ff3b400
mst at address 0003FF002FF3B400
prev : 0000000000000000
intpri : INTBASE
stackfix : 0000000000000000
backt :
kjmpbuf : 0000000000000000
emulator : NO
excbranch : E000000000020A80
excp_type : EXTINT(10)
ipsr : 00001010080AE030
isr : 0000000000000000
iip : E00000000000B970
ifa : E000009729F4F22A
iipa : E00000000000B960
ifs : 8000000000000716
iim : 00000000000000F4
fpsr : 0009804C0270033F
fpowner : LOW/HIGH
fpeu : YES
... tons of more stuff ...
(0)> q
From /usr/include/sys/machine.h :
#define PSR_PK
15
00001010080AE030 (HEX) =
100000001000000001000000010101110000000110000 (Binary)
Bit 15 is set, which means that the thread has the Protection Key set, and
hence is in a problem state.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Limits
Kernel Limits
Most of the settings in the kernel are dynamic and don’t need to be tuned.
Their maximum values are considered to be chosen in such a way that
they will never be reached during normal system usage. Some limits
chosen as a maximum can technically be even higher.
The following table lists kernel system limits as of AIX 5L Version 5.0
Semaphores
Maximum number of
semaphore IDs
Maximum
semaphores per
semapore IDs
Maximum operations
per semop call
Maximum undo
entries per process
Size in bytes of undo
structure
Semaphore maximum
value
Adjust on exit
maximum value
32-bit kernel
131072
64-bit-kernel
131072
65535
65535
1024
1024
1024
1024
8208
8216
32767
32767
16384
16384
Message Queues
Maximum message
size
Maximum bytes on
queue
Maximum number of
message queue IDs
Maximum messages
per queue ID
32-bit kernel
4 MB
64-bit kernel
4 MB
4 MB
4 MB
131072
131072
524288
524288
Continued on next page
-12 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Kernel Limits
-- continued
Kernel Limits
Shared Memory
Maximum region size
Minimum segment
size
Maximum number of
shared memory IDs
Maximum number of
segments per process
32-bit kernel
2 GB
1
64-bit kernel
2 GB
1
131072
131072
11
268435465
There are a couple of kernel parameters which affect the availability of
semaphores (semaem, semmap, semmni, semmns, semmnu, semume).
Please check them by referencing the working system. Please keep in
mind that other applications could also affect the availability of
semaphores.
LVM
Maximum number of
VGs
Maximum number of
PP’s per hdisk
32-bit kernel
255
64-bit kernel
4095
1016
1016
Maximum number of 256
LVs
Maximum number of 65535
major numbers (see
note 1)
Maximum number of 1024
VMM-mapped devices
(see note 2)
512
Maximum number of
disks per VG
128
32
1073741823
1024
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Limits
-- continued
Kernel Limits
Filesystems
Maximum file system
size (see note 3)
Maximum file size
(see note 4)
Maximum size of log
device
Maximum number of
file system inodes
Maximum number of
file system fragments
Maximum number of
hard links
JFS
JFS2
1 TB
32 PB
64 GB
32 PB
256 MB
32 PB
2^24
Unlimited
2^28
N/A
32767
32767
Miscellaneous
32-bit kernel
Maximum number of 131072
processes per system
Maximum numbers of 262143
threads per system
64-bit kernel
131072
Maximum number of
open files per system
Maximum number of
open files per process
Maximum number of
threads per process
Maximum number of
processes per user
Maximum physical
memory size
Minimum physical
memory size
Maximum value for
the wall
32767
Unlimited (resource
bound)
32767
32767
32767
131072
131072
4 GB
1 TB
32
256 MB
1 GB
4 GB
1000000
262143
Continued on next page
-14 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Limits
Kernel Limits
Guide
-- continued
Notes:
1. Each volume group takes one major number; some are reserved for
the OS and for other device drivers. Run "lvlstmajor" to see the range
of free major numbers; rootvg always uses 10.
2. VMM-mapped devices are mounted JFS/CDRFS file systems, open
JFS log devices, and paging spaces. Of 512, 16 are pre-reserved for
paging spaces. These devices have are indexed through the kernels
Page Device Table (PDT), which is a fixed size array.
3. To achieve 1TB, the file system must be created with npbi=65536 or
higher and frag=4096.
4. To achieve around 64 GB files, the file system must be created with the
-a bf=true flag AND the application must support files greater than 2
GB.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Kernel Limits Exercises
Checking
kernel values
The purpose for this exercise is to find actual limit or settings in a running
kernel. From the file /usr/include/sys/msginfo, we obtain the structure
msginfo that holds four integers. To list the content in the running kernel,
we use kdb fat Power and iadb at IA-64 platform. From both systems, we
display 16 bytes equal to four integers.
/*
*
Message information structure.
*/
struct msginfo {
int msgmax, /* max message size */
msgmnb, /* max # bytes on queue */
msgmni, /* # of message queue identifiers
*/
msgmnm; /* max # messages per queue identifier
*/
};
Power
# kdb
(0)> d msginfo
msginfo+000000: 0040 0000 0040 0000
msgmax
msgmnb
IA-64
0002 0000
msgmni
0008 0000
msgmnm
# iadb
> d msginfo 4 4
e00000000415cfb0: 00400000 00400000 00020000 00080000
msgmax
-16 of 38 AIX 5L Internals
msgmnb
msgmni
msgmnm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
64-bit Kernel base enablement
64-bit Kernel
base
enablement
Several components of base enablement support are provided to make it
possible for kernel subsystems and kernel extensions to run in 64-bit mode
and use a large address space.
State
management
support
Support is provided for saving and restoring 64-bit kernel context,
including full 64-bit GPR contents. This support also extends to the area of
kernel exception handling where setjmpx() and longjmpx() must deal with
64-bit kernel context. In addition, state management is extended to include
the 64-bit kernel address space as part of the kernel context.
Temporary
attachment
The 64-bit kernel provides kernel subsystems and kernel extensions with
the capability to change the contents of the kernel address space. This
includes the capability to change segments within the address space
temporarily for a specific thread of execution and is consistent with the
segmented virtual memory architecture of the hardware and the legacy 32bit kernel programming model.
A total of four concurrent temporary attachments will be supported under a
single thread of execution. This limitation is consistent with the limitation
imposed by the 32-bit kernel and is made to restrict the amount of kernel
state that must be saved and restored at context switch.
Global
attachment
While the temporary attachment model is maintained, the 64-bit kernel
also provides a model under which subsystem data is placed within the
global kernel address space and made visible to all kernel code for the
entire life of its usefulness, rather than temporarily attaching segments as
needed and in the context of a single thread.
This global attachment model does more than allow the 64-bit kernel to
provide sufficient space for subsystems to place their data in the global
kernel heap. Rather, it includes the capability to place subsystem
segments within the global address space. This capability is needed for
two reasons:
• Different memory characteristics
• Data organized around segment
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel base enablement
Global
attachment
-- continued
Some subsystems require virtual memory characteristics that are different
from those of the kernel heap. For the most part, these characteristics are
defined at the segment level and typically must be reflected by segment
types that are different from those used for the kernel heap. Also, some
subsystems organize their data around segments and require sizes and
alignments that are inappropriate for the kernel heap.
The global attachment model is of importance for a number of reasons.
First, it is more scalable than the temporary attachment model. This is
particularly true for subsystems that require large portions of their data to
be accessible at the same time for a single operation. As the volume of this
data increases to meet workload or hardware requirements, the temporary
attachment model proves impractical for these subsystems, as increasing
numbers of segments must be attached and detached. An example of
such a subsystem is the VMM, where page fault resolution and virtual
memory kernel services require access to all page frames and segment
descriptors.
The global attachment model is also of value in cases where only a small
number of subsystem segments are involved. Segments are attached to
the global kernel addresses space only once, typically at subsystem
initialization, and are accessible from then on without requiring individual
subsystem operations to incur the path length cost of segment attachment.
This is not to say that the global attachment model is without its own path
length costs; specifically, use of this model may result in more segment
lookside buffer (SLB) reloads. This is because it provides no opportunity to
prime the SLB table with virtual segment IDs (VSIDs) for soon-to-beaccessed segments. Rather, it relies upon the caching nature of the SLB
table and updates SLBs with new VSIDs only when satisfying reload faults.
This differs from the temporary attachment model where VSIDs are placed
in the SLB as part of segment attachment.
Continued on next page
-18 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel base enablement
Global
attachment
Guide
-- continued
Finally, this model simplifies the general kernel programming model.
Subsystems are not required to deal with the complexity of segments,
segment offsets or segment attachments in accessing their data. Rather,
data accesses are made simply and naturally using addresses within the
flat kernel address space.
The specific subsystem segments that will be placed in the kernel address
space under the global attachment model include:
• Kernel Heap
Although traditionally part of the global address space, the
kernel heap segments will be placed in this space through
global attachment.
• File System Segments
The global segments used to hold the file and inode tables will
be provided through global attachment.
• mbuf Segments
The mbuf pool has long been a part of global space and this will
continue under the 64-bit kernel.
• VMM Segments
These segments are privately attached in the 32-bit kernel
legacy and hold the software page frame table, segment control
blocks, paging device table, file system lockwords, external
page tables, and address space map entries.
• Process and Thread Tables
Global attachment is used for the segments required for the
globally addressable process and thread tables.
All segments added to the global kernel address space through global
attachment will be strictly read/write for the kernel and no-access for users.
In addition, unaligned accesses to these segments will not be supported
and will result in a protection exception.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel base enablement
Data isolation
-- continued
While placing subsystem data in the global kernel address space provides
significant benefits, it eliminates the data isolation that is provided by the
temporary attachment model. Under this model, data is typically made
accessible only while running subsystem code and is not generally
exposed to other subsystems. Unrelated interrupt handlers may gain
accessibility to data by interrupting subsystem code. However, this
exposure is more limited than that which occurs by placing data in global
space where all kernel code has accessibility.
Isolation is critical for some classes of subsystem data. As a result, not all
subsystem data should be placed in the global kernel address space. In
particular, file systems should continue to use temporary attachments to
provided isolation for user data.
Kernel address
space layout
The kernel address space layout preserves the existing 32-bit and 64-bit
user address layouts that is now found under the 32-bit kernel legacy. In
addition, a common global kernel and per-process user address space is
provided. This is required for a number of performance reasons:
• Efficient transition between kernel and user mode
• Preservation of SLBs
• Reduces complexity
• Single per-process segment table
To begin, a common address space improves the efficiency of transition
between kernel and user mode since there is no need to switch address
spaces. Next, it preserves SLBs. This is because the segments within the
user and kernel address space are common, so there is no need to use
separate SLBs or perform SLB invalidation at user/kernel transitions. Also,
a common address space reduces the complexity and path length that is
associated with kernel access to user space. There is no longer a need for
the kernel to gain address ability to segments from a separate user
address space in performing accesses or to serialize accesses against
changes in the user address space. Rather, user segments are already in
place and properly serialized in the common address space. Finally, the
common address space supports the efficiency of a single per-process
segment table.
Continued on next page
-20 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel base enablement
Guide
-- continued
Kernel address
space layout
Temporary attachments are not included as part of the common address
space. This is for a number of reasons. First, data isolation would be
impacted for temporary attachments if they were placed in the common
address space. This is because the attached data would be accessible in
the kernel by all threads of a process rather than only by the thread that
performed the temporary attachment. Second, it would be inefficient for the
common address space to include temporary attachments. This is due to
the fact that changes to the common address space would have to be
serialized among all threads of a process.
I/O space
mapping
The 64-bit kernel supports I/O space at locations below and above 4 GB
within the hardware system memory map. Under the 64-bit kernel, I/O
space is virtually mapped through the page translation hardware and made
accessible through segments on all supported hardware system
implementations. In the 32-bit kernel legacy on current hardware systems,
I/O space virtual access is achieved through block address translation
(BAT) registers, but this capability is not provided by the Gigaprocessor
hardware.
Performance
when
accessing I/O
addresses
The capability to place portions of I/O space within the global kernel
address must be provided to allow temporary attachment overhead to be
avoided. This capability is built upon the global attachment model. Along
with services to support this, others services are provided that allow
portions of I/O space to be temporarily attached. However, these services
will form an I/O space temporary attachment model that is slightly different
from the one now found under the 32-bit kernel. Specifically, I/O space
mappings must be created prior to any temporary attachments and
destroyed once all temporary attachments are complete. These mapping
operations are performed by individual device drivers through new
services and typically occur at the time of device configuration and deconfiguration. Compare to the existing model under the 32-bit kernel,
where no separate mapping operations are present.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel base enablement
I/O mapping in
64-bit kernel
mode
-- continued
The mapping operations are provided under the 64-bit kernel model for a
number of reasons. The first is performance. While the 32-bit kernel model
does not require I/O space to be mapped before it is attached, it does
require each temporary attachment to perform some level of mapping.
Under the 64-bit kernel model, each device driver maps its portion of I/O
once at initialization time and incurs no additional mapping overhead in
performing temporary attachments. Next, the presence of the mapping
operations provide efficient use of system resources. I/O space is mapped
in virtual memory through the page table and segments under the 64-bit
kernel and these system resources are only consumed for portions of I/O
space that are actually in use. In the absence of mapping operations, the
64-bit kernel itself would have to map all of I/O space into virtual memory
and possibly waste resources for unused portions. In addition to potentially
wasting resources, arming the kernel with the responsibility of mapping I/O
space would lead to arbitrary layouts of I/O space in virtual memory and
would not support data isolation. Finally, the interfaces for performing
temporary attachments are simplified, as no I/O mapping information must
be specified. This implies new interfaces for attaching and detaching from
I/O space.
The new I/O space temporary attachment model and supporting services
is not only provided under the 64-bit kernel but under the 32-bit kernel as
well. This is required to ease the migration of 32-bit device drivers to the
64-bit kernel environment and to make it simpler to maintain 32-bit and 64bit versions of a single device driver.
Rather than placing their respective portions of I/O space in the global
kernel address space, most device drivers should continue to access I/O
space through temporary attachments. This is because a large proportion
of these accesses occur under interrupts and would more than likely miss
the SLB table if the accesses were performed using the global attachment
model. While the temporary attachment model adds overhead to I/O space
accesses, it typically avoids the SLB miss performance penalty by priming
the SLB table.
Continued on next page
-22 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel base enablement
Guide
-- continued
LP64 C
language data
model
The 64-bit kernel uses the LP64 (Long Pointer 64-bit) C language data
model. This data model was chosen for a number of reasons. First, the
LP64 data is also used by 64-bit AIX applications, and this allows the 64bit kernel to support these applications in a straightforward manner. Of the
prevailing 64-bit data models, including ILP64 and LLP64, the LP64 data
model is most consistent with the ILP32 data model used by 32-bit
applications. This consistency simplifies 32-bit application support under
the 64-bit kernel and allows 32-bit and 64-bit applications to be supported
in fairly common ways. Next, LP64 has been chosen as the data model for
the 64-bit kernel implementations provided by key UNIX vendors, including
SGI, SUN, and H-P. Use of a common data model simplifies matters for
ISVs, and enables AIX to use industry wide solutions to some problems.
Finally, the 64-bit kernel requires no new compiler functionality and can
use the existing 64-bit mode compiler.
Register
conventions
The register conventions used in the 64-bit kernel environment are the
same as those used in the 64-bit application environment. This means that
general purpose register 13 will be reserved for operating system use.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel stack
Kernel stack
64-bit code has greater stack requirements than 32-bit code. This is for two
reasons. First, the amount of stack space required to hold subroutine
linkage information increases for 64-bit code, since this information is
made up of register and pointer values and these values are larger 64-bit
quantities. Second, long and pointer values are 64-bit quantities for 64-bit
code and consume more space when maintained as stack variables.
The larger stack requirements of 64-bit code also means that stack-related
sizes under the 64-bit kernel are increased over those of the 32-bit kernel.
In fact, most existing stack sizes will double.
Minimum stack
size
Under the 64-bit kernel, the components of the common subroutine
linkage, such as the link register and TOC pointer, are 64-bit quantities. As
a result, the minimum stack frame size is 112 bytes.
Process
context stack
size
Consistent with the 32-bit kernel, the kernel stacks for use in process
context are 96 KB in size. This size should prove to be sufficient for the 64bit kernel, since it has been found to be twice that of what is actually
needed for the 32-bit kernel.
Interrupt stack
size
The interrupt stack will be 8 KB in size under the 64-bit kernel. This size is
clearly warranted, since some interrupt handlers find the 4 KB interrupt
stack size of the 32-bit kernel to be insufficient.
Dynamic
resource pools
To allow scalability, resource pools are allocated dynamically from the
kernel heap and through separately created segments intended for this
purpose. This means that some existing resource pools, like the shared
memory, message queue, and semaphore ID pools, are relocated from the
kernel BSS.
Continued on next page
-24 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
64-bit Kernel stack
Kernel heap
Guide
-- continued
The kernel heap is the home of most kernel data structures, and is
sufficiently large to allow subsystems to scale fixed resource pools, while
at the same time, providing adequate space for dynamically allocated
resources. To provide this, the kernel heap is expanded to encompass a
larger number of segments and placed above 4 GB within the global kernel
address space to accommodate its larger size.
While the kernel heap is extended and moved above 4 GB, the interfaces
provided for the allocation and freeing from this heap are the same as
those provided under the 32-bit kernel. The use of these interfaces is
pervasive, so common interfaces eases the 64-bit kernel porting effort for
kernel subsystems and kernel extensions and makes it simpler to support
both kernels.
The kernel heap is now expanded to 16 segments, for a total of about 4GB
of allocatable space. This is more than eight times larger than the space
available under the 32-bit kernel.
Allocation requests are only limited in size by the amount of available heap
space, rather than by some arbitrary limit. This means that the segments
that make up the kernel heap are laid out contiguously within the address
space, and requests for more than a segment size worth of data is granted
if sufficient free space is available. It also means that a request can be
satisfied with space that crosses segment boundaries.
A separate global heap reserved for the loader is provided in segment zero
(that is, the kernel segment). This heap is used to hold the system call
table and svc_instructions code for 32-bit applications and must be placed
in segment zero, because it is the only global segment that is mapped into
the 32-bit user address space. This heap is also used to hold the system
call table for 64-bit applications and loader sections for kernel extensions.
This data is located in the loader heap because it must be readable in user
mode. This type of access is not supported for the kernel heap.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
CPU big- and little-endian
Memory view
for big and
little endian
systems
Although both Power and IA-64 architectures support big-endian and
little-endian implementations, the endian of AIX 5L running on IA-64 and
AIX 5L on PowerPC are different. AIX 5L for IA-64 is little-endian, and AIX
5L for PowerPC is big-endian.
Logically, in multi-digit numbers, leftmost digits are more significant, and
rightmost least. For example, in the four-digit number 8472, the 4 is more
significant than the 7.
Now, when you look at the system memory, we can look at it in two ways.
The example shows a 100 byte memory seen the two ways. Try to write
the number 1234567890 at address 0-9 in both figures. What is the digit in
the byte at address two?
address
address
99
89
79
69
59
49
39
29
19
09
90
80
70
60
50
40
30
20
10
0
address
address
00
10
20
30
40
50
60
70
80
90
09
19
29
39
49
59
69
79
89
99
Continued on next page
-26 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
CPU Big and little Endian
Register and
memory byte
order
-- continued
Computers address memory in bytes while manipulating data in words (of
multiple bytes). When a word is placed in memory, starting from the lowest
address, there are only two options: Either place the least significant byte
first (known as little-endian) or place the most significant byte first (known
as big-endian).
register
bit 63
0
a b c d
e f
g h
e f
g h
4
6
big-endian memory
a b c d
address
0
1
2
3
5
7
little-endian memory
h g
address
0
1
2
d c
e
f
3
4
5
b a
6
7
In the register layout shown in the figure above, “a” is the most significant
byte, and “h” is the least significant byte. The figure also shows the byte
order in memory. On big-endian systems, the most significant byte will be
placed at the lowest memory address. On little-endian systems, the least
significant byte will be placed at the lowest memory address.
Power, PowerPC, most RISC-based computers, IBM 370 computers, and
Internet protocol (IP) are some examples of things that use the big-endian
data layout. Intel processors, Compaq Alpha processors, and some
networking hardware are examples of things that use the little-endian data
layout.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Multi Processor dependent designs
Kernel lock
The kernel lock is not supported under the 64-bit kernel. This lock was
originally provided to allow subsystems to deal with the pre-emptive nature
of AIX kernel on uniprocessor hardware, while later being used as a mean
for ensuring correctness for non-MP-safe subsystems on MP hardware. At
a minimum, all 64-bit kernel subsystems and kernel extension must be
MP-safe, with most required to be MP-efficient to meet performance
requirements. As a result, the kernel lock is no longer required.
Device
funneling
Under the 64-bit kernel, no support will be provided for device funneling.
This means that all device drivers must be MP-safe and identify
themselves as such when registering devices and interrupt handlers.
Device funneling was originally provided under the 32-bit kernel so that
non-MP-safe device drivers could run correctly on multi-processor
hardware with no change. However, all device drivers must change to
some extent under the 64-bit kernel and this provides the opportunity to
simplify the 64-bit kernel by not providing device funneling support and
requiring additional changes for the set of device drivers that are not MPsafe.
Of the existing IBM Austin-owned device drivers, only the X.25 and
graphics device drivers are not MP-safe. However, this is of no concern,
since X.25 will not be provided under the 64-bit kernel and the (new)
graphics drivers that will be provided in the time frame of the 64-bit kernel
will be MP-safe.
-28 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Command and Utility compatibility for 32-bit and 64-bit kernels
Commands
and utilities
A number of AIX-supplied commands and utilities deal directly with kernel
details and require different implementation under the different kernels.
Commands based upon /dev/kmem or /dev/mem serve as an example.
While two different implementations may be required, the AIX-supplied
commands and utilities must use a common binary. This is required to
support a common system base and means that a single binary front-end
must be used, but does not dictate that only a single binary be used. In
fact, two binaries make sense in cases where kernel data structures are
used (like vmstat) and these data structures have different sizes or formats
under 32-bit and 64-bit compilations. Rather than duplicating data
structures for a single binary, both a 32-bit and a 64-bit binary version are
provided; one of these serves as a front-end and executes the other when
the bit-ness of the kernel does not match its own. This implementation
ensures that there is one common command interface for both 32-bit and
64-bit kernels utilities.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Exceptions
Exceptions
and interrupts
distinction
The distinction between the terms "exception" and "interrupt" is often
blurred. The bulk of AIX documentation refers to both classes generically
as "interrupts," while the hardware documentation (like the PowerPC 60x
User’s Manuals) makes the distinction. We will try to keep the terms
separate.
Definition of
exceptions
Exceptions are synchronous events that are normally caused by the
process doing something illegal.
An exception is a condition caused by a process attempting to perform an
action that is not allowed, such as writing to a memory location not owned
by the process, or trying to execute illegal operations. For illegal
operations, the kernel traps the offending action and delivers a signal to
the process causing the exceptions, (or crashes, if the process was in
kernel mode). Exceptions can also be caused by a page fault. A page fault
is a reference to a virtual memory location for which the associated real
data is not in physical memory.
Determine the
action taken
on an
exception
The result of an exception is either to send a signal to the process or to
crash the machine. The decision is based upon what kind of exception
occurred and whether the process was executing in user mode or kernel
mode:
•
Exceptions are caused within the context of a process.
•
A process may NOT decide how to react to the exception.
•
Exception handlers are kernel code and run without regard to the process, except
to cleanly handle the exception generated by the process.
•
Some exceptions result in the death of the process.
• Some exception types can be found in V\VPBH[FHSWK!
A process can decide how to respond to the signal generated by the
exception in certain cases. For example, a process can decide to catch the
signal for SIGILL, which is delivered when a process in user mode
executes an illegal instruction.
An exception is also a mechanism to change to supervisor state as a result
of:
•
Program errors
•
Unusual conditions
•
Program requests
Continued on next page
-30 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Exceptions
Branching to
exception
handlers
Guide
-- continued
After an exception, the system switches to supervisor state and branches
to an exception handler routine. The branch address is found from the
content of a specific memory location called "vector."
Examples of exceptions vectors:
• System reset
• Machine check
• Data storage interrupt (DSI)
• Instruction storage interrupt (ISI)
• Alignment
• Program (invalid instruction or trap instruction)
• Floating-point unavailable
• Decrementer
• System call
System reset
exception
The system reset exception is used when a system reset is initiated by the
system administrator. This generally causes a "soft" reboot of the system.
Machine check
exception
The machine check exception is generated when a hardware machine
check occurs. This generally indicates either a hardware bus error or bad
real address access. If a machine check occurs with the ME bit off, then a
machine checkstop occurs. Generally, a machine check exception causes
a kernel crash dump to be generated. A machine checkstop causes no
kernel crash dump to be generated, though a checkstop record is
generated.
Data storage
exception
Data storage interrupt (DSI) and instruction storage interrupt (ISI)
exceptions are caused by hardware not being able to find a translation for
a instruction fetch or load/store operation. These generally result in a page
fault.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Exceptions
-- continued
Alignment
exception
Alignment exceptions are generated when an instruction generates an
unaligned memory operation that can not be completed by the hardware.
Which unaligned operations can not be handled by the hardware are
processor dependent. This exception generally results in AIX performing
the unaligned operation with special purpose code.
Invalid
instruction
exception
The program instruction is generated when an illegal instruction or trap
instruction is generated. This is generally caused by debugger breakpoints
in a process being hit. This exception generally results in a call to an
application or kernel debugger.
Floating point
unavailable
exception
The floating point unavailable exception is caused when a thread executes
a floating point instruction when floating point operations are not allowed.
This generally indicates that a thread has not executed any floating point
instructions yet or that another thread’s floating point data is currently in
the processor’s floating point registers. AIX does not save a thread’s
floating point register values until it first uses the floating point registers.
On UP systems, AIX does not save off floating point registers for the
currently running thread when another thread is dispatched. Often, no
other thread will use the floating point registers before the thread is again
dispatched. This saves AIX having to save and restore the floating point
registers on every thread dispatch.
Decrementer
exception
The decrementer exception is caused when the decrementer register has
reached the value zero. This indicates that a timer operation has
completed.
System call
exception
The system call exception occurs whenever a thread executes a system
call.
-32 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Interrupts
Description of
interrupts
Interrupts are asynchronous events that may be generated by the system
or a device, and "interrupts" the execution of the current process.
Interrupts usually occur when a process is running and some
asynchronous event occurs such as disk I/O completion or a clock tick.
The event usually has nothing to do with the current running process. The
kernel immediately preempts the current running process to handle the
interrupt. The state of the machine is saved on the stack and the interrupt
is handled. The user process has no knowledge that the interrupt
occurred.
Interrupts are one of the major reasons that AIX cannot be a hard real-time
system. No guarantee can be made as to how long it may take for some
action to occur as it may get interrupted any number of times during the
action.
Interrupts are caused outside the context of a process. In general, a
process may NOT *decide how to react to the interrupt. Interrupt handlers
are kernel code and run without regard to the process unless the nature of
the interrupt is to update some process related structure, *statistics, and so
on.
Interrupt levels
Each interrupts has a level and an associated priority; the level is a value
that is used to differentiate between interrupts. The priority ranks the
importance of each one.
Devices, such as adapter cards, with interrupt facilities have a interrupt
level associated. When the system receive an interrupt with that level, AIX
then knows that it was caused by the device at that level.
In AIX, devices may share interrupt levels such that more than one adapter
may share the same level.
Controlling
Interrupts
A kernel process can disable some or all types of interrupts for short
periods. The interrupted process will safely return to continue execution.
Some interrupt types can be found in <sys/m_intr.h>
Most interrupts are not concerned with which process is getting
interrupted. The major counter example is the clock interrupt. This is used
to update the run-time statistics for the currently running process.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Interrupts
-- continued
Critical
sections
A critical section is a code section that must be executed without any
break. For example: if data is examined and changed based on the value.
A process would disable interrupts across a critical section to ensure that
the section is executed without breaks.
Out of order
instruction
sets and
Interrupts
On modern processors, such as Power and IA-64, many instructions are
being executed at one time. When a hardware interrupt occurs,
instructions are executed to completion and any following instructions are
terminated with no effect on the processor registers or memory; results
from out of order instructions are discarded. This is what is meant by
"interrupts are guaranteed to occur between the execution of instructions."
The processor makes sure that the effect of its operations are equivalent to
an interrupt occurring between the execution of instructions.
-34 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Interrupt handling in AIX
Interrupt
handling
When an interrupt is received, AIX performs several steps to handle the
interrupt properly:
• Saves the current state of the machine.
• Determines the real handler for the interrupt.
• Calls that handler to "service" the interrupt.
• Restores the machine state if and when the handler completes.
Interrupt
priorities
Interrupt priorities have no relationship to process and thread scheduling
priorities.
AIX associates priorities with each type of interrupt. A lower priority
number means a more favored interrupt. Interrupt processing can itself be
interrupted, but only by a more favored (lower priority number) interrupt.
Interrupt routines usually allow themselves to be interrupted by higher
prioritized interrupts, but refuse to take less favored interrupt; however,
interrupt routines and other programs running in kernel mode can
manually raise or lower their interrupt priority. This is called "disabling or
enabling interrupts." The reason for this is that a high prioritized disk
handler must complete in time before new data arrives, and it does not
want to be interrupted by less prioritized interrupts.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-35 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Handling CPU state information at interrupt
Saving and
restoring
machine state
AIX maintains a set of machine state save (mstsave) areas. Each
processor has a pointer to the mstsave area it should use when the next
interrupt occurs. This pointer is called the current save area, or csa pointer.
When state needs to be saved, AIX:
• Saves almost all registers into the mstsave pointed to by this
processor’s csa.
• Gets the next available mstsave area from this processor’s pool.
• Links just-saved mstsave to new mstsave.
• Updates this processor’s csa to point to a new area.
When an interrupt handler returns, AIX must restore the machine state that
was in effect when the interrupt occurred. AIX does this by:
• Reloading registers from the processor’s previous mstsave area.
• AIX then sets the processor’s csa pointer to the (now unused)
previous mstsave area.
• If returning to base interrupt level, AIX generally reruns the
dispatcher to determine which thread to resume.
• The interrupt might have made another thread runnable.
-36 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Guide
Handling CPU state information at interrupt
mstsave area
description
Because the mstsave (machine state) areas are linked together, the
mstsave areas provide an interrupt history stack.
csa
prev
prev
prev
mstsave
mstsave
mstsave
mstsave
unused
(next interrupt
goes here)
high
priority
interrupt
low
priority
interrupt
base
interrupt
level
Whenever AIX receives an interrupt that is of higher priority than what it is
currently doing, it must save the state of the machine into an mstsave
area. The csa (Current Save Area) pointer points to an unused mstsave
area that AIX can use if another, higher-priority interrupt comes in. This
area may contain stale data from being used for a previously-handled
interrupt, but its prev pointer always points to the previous mstsave area
(or is null if there aren’t any more in use at that time).
These areas are linked together from most-recently to least-recently used,
so this means that they go from higher to lower interrupt priority. At the end
of the mstsave chain is the mstsave area for the base interrupt level. This
mstsave area contains the state of the machine when it was last doing
something other than interrupt processing (that is, the machine state when
the oldest interrupt that we are currently processing came in).
Size limitation
on mstsave
area and
interrupt stack
The stack used by an interrupt handler is kept in the same page as the
mstsave area. This limits the stack to 4K on the 32-bit kernel and 8k on 64bit kernel minus the size of the mstsave area. Using this area for the stack
ensures that the stack is pinned, which is required for interrupt handlers.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-37 of 38
Guide
Draft Version for review, Sunday, 15. October 2000, kernel_overview.fm
Handling CPU state information at interrupt
-- continued
Saving base
level machine
state
base level mst save area
The thread’s base level state save area
is in the initial thread’s uthread block.
initial thread’s uthread block
The initial thread’s ublock is in the
process’ ublock
In the 32-bit kernel, there is also the
user64 area, which is used to save the
64-bit user registers for 64-bit
processes.
user area
user64 (32-bit kernel only)
process ublock
The user64 area is only used when the process is a 64-bit process in a 32bit kernel. If the user64 area is being used it is initialized and pinned. The
area is created when a process calls exec() for a 64-bit executable. It is
destroyed when a 64-bit process exits or calls exec() for a 32-bit
executable.
The portion of the base level state save area that contains the 32-bit
registers is unused for 64-bit processes.
At a 32-bit kernel, only the base level state save (MST) area needs to have
a 64-bit register state save area (user64) associated with it. Since all
interrupt handlers run in 32-bit kernel mode, all state save areas other than
the base level state save area only needs to save 32-bit states (even on
64-bit hardware). At a 64-bit kernel all MST areas are 64-bit.
-38 of 38 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
Unit 2. IA-64 Hardware Overview
This unit describes: The /proc filesystem in the AIX 5L kernel.
What You Should Be Able to Do
• list the registers available to programs
• describe how EPIC improves performance
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Hardware Overview
Introduction to
IA-64
IA-64 is Intel’s 64-bit architecture, based on the Explicitly Parallel
Instruction Computing (EPIC) design philosophy. These are the IA-64
goals:
• Overcome the limitations of today’s architectures.
• Provide world class floating point performance.
• Support large memory needs with 64-bit addressability.
• Protect existing investments with IA-32 compatibility.
• Support growing high-end application workloads for e-business,
enterprise, and technicalcomputing.
Performance
IA-64 increases performance by using available compile-time information
to reduce current performance limiters, thereby moving some of the
performance burden from the microarchitecture to the compiler. This
enables designing simpler processors, which are more likely to achieve
higher frequencies.
To achieve improved performance, IA-64 code:
• Increases instruction level parallelism (ILP)
• Improves branch handling
• Hides memory latencies
• Supports modular code
IA-64 increases ILP by providing more architectural resources: large
register files, and a 3-instruction wide word.
The architecture also enables the compiler/assembly writer to explicitly
indicate parallelism.
Branch handling is improved by providing the means to minimize branches
in the code, increase branch prediction rate for the remaining branches
and providing specific support for typical branches.
Memory latency is reduced by allowing the compiler to schedule loads
earlier in the code and enabling memory hierarchy cache management.
IA-64 supports the current compiler trend to produce modular code by
providing specific hardware support for function calls and returns.
-2 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 formats
Data types
The following data types are supported :
• Integer: 1, 2, 4 and 8 byte(s)
• Floating-point single, double and double-extended formats
• Pointers: 8 bytes
Integer data types
63
31
15
7
0
Floating-point Data Types
79
63
31
0
The basic IA-64 data type is 8 bytes. Apart from a few exceptions, all
integer operations are on 64-bit data, and registers are always written as
64 bits. Therefore, 1, 2 and 4 byte operands loaded from memory are zeroextended to 64 bits.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 formats
Instruction
format
-- continued
A typical IA-64 instruction is a three operand instruction, with the following
syntax:
[(qp)] mnemonic[.comp1][.comp2] dests = srcs
(qp)
mnemonic
[comp1][comp2]
dests, srcs
A qualifying predicate is a predicate register
indicating whether or not the instruction is
executed. When the value of the register is true
(1), the instruction is executed. When the value
of the register is false (0), the instruction is
executed as a NOP. Instructions that are not
explicitly preceded by a predicate, assume the
first predicate register, p0, which is always true.
Some instructions cannot be predicated.
A unique name identifying the instruction.
Some instructions may include one or more
completers. Completers indicate optional
variations on the basic mnemonic.
Most IA-64 instructions have at least two source
operands and a destination operand. Source
operands are used as input. Typically, the
source operands are registers, or immediates.
The destination operand(s) is typically a register
to which the result is written.
Some examples of different IA-64 instructions:
Simple Instruction
add r1 = r2, r3
Predicated instruction
(p4)add r1 = r2, r3
Instruction with immediate
add r1 = r2, r3, 1
Instruction with completer
cmp.eq p3 = r2, r4
-4 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 memory
Memory
organization
IA-64 defines a single, uniform, linear address space of 2^64 bytes which
is divided into 8 regions of size 2^61. A single space means that both data
and instructions share the same memory range. Uniform means that there
are no address regions with predefined functionality. Linear means that the
address space contains no segments; all 2^64 bytes are consecutive.
All code is stored in little-endian byte order in memory. Data is typically
stored in little-endian byte order. IA-64 also provides support for big-endian
code and operating systems.
Moving data between registers to and from memory is performed strictly
through the load (ld) and store (st) instructions. IA-64 supports loads and
stores of all data types. Because registers are written as 64-bit, loads are
zero-extended. Stores always write the exact number of bytes for the
required format.
The size of memory location is specified in the opcode as a number
• st1/ld1 = byte (8bits)
• st2/ld2 = halfword (16 bits)
• st4/ld4 = word ( 32 bits)
• st8/ld8 = doubleword ( 64 bits)
Examples :
// Loads 32 bits from address 4 + r30 into r31 High 32-bits cleared on 64-bit
processor
add r31 = 4, r30
ld4 r31 = [r30]
//Stores 64 bits from r3 to address r29 - 8
add r24 = -8, r29
st8 [r24] = r3
//Loads 8 bits from address 27+r1 into r3
add r2 = 0x27, r1
ld1 r3 = [ r2 ]
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 memory
Region Usage
-- continued
On IA64, the 64-bit linear address space consists of 8 regions of size 2^61
with the upper 3-bits of the address selecting a virtual region, a physical
region register, and an associated region identifier. The region
identifier (RID), much like the POWER segment identifier (SID),
participates in the hardware address translation such that in order to share
the same address translation, the same RID must be used. The sharing
semantic (private, globally shared, shared-by-some) is determined by
whether or not multiple processes utilize the same RID.
For example, a process’s private storage resides within a region whose
RID is mapped only by that process. Therefore, address space usage is in
a large part determined by assigning the desired sharing semantics to
each of the 8 virtual regions and mapping the appropriate objects into
those regions that require those semantics.
There are two imporant properties associated with this region usage. First,
the mapping of objects to regions is many-to-one. That is, multiple objects
map into a single region. Second, mapping the same object to different
regions results in aliases. This is a distinct difference from the POWER
architecture where an object (a.k.a. SID) is addressed the same
regardless of the virtual address used. Aliases simply additional address
translations on IA64 and thus a likelyhood for decreased performance and
so their use should be minimized.
Another significant departure from AIX is that the majority of the 64-bit
address space is managed using Single Address Space (SAS) semantics.
This is necessary to achieve the desired degree of sharing of address
translations for shared objects: to achieve a single translation for an object
all accesses must be made through a common global address. Such a
semantic is possible by virtue of the IA64 protection keys which provide
additional access control beyond address translations. So, a process that
maps a region only has accessibility to those objects within that region for
which it has the appropriate protection key. Note that AIX manages some
parts of the process address space as SAS -- for example, the shared
library text segment contains mappings whose addresses are common
across all processes. The AIX use of the SAS style of management is
minimal because the POWER architecture provides for sharing on a
segment basis regardless of the virtual address used to map the segment.
To achieve the same degree of sharing on IA64 a shared object must be
mapped at a global address.
Continued on next page
-6 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 memory
region usage
continued
Guide
-- continued
In addition to the sharing semantics there are additional properties that
influence the location of objects within regions. First, to preserve the flataddress space with a logical boundary between user and kernel space it is
useful to place user and kernel objects at opposite ends of the address
space whenever feasible. Next, the IA64 architecture provides for
multiple page sizes and a preferred page size per region so objects with
similar page size requirements are most naturally colocated within the
same region. Finally, certain object types such as executable text have
properties and uses which mandate that they be isolated to a separate
region.
Given these general guidelines, the following table shows the selected
region usage and subsequent sections describe each region use in greater
detail. These selections provide for 4 regions dedicated to user space and
3 for kernel for the initial release.
VRNStyle
0 MAS
Private
1 SAS/MAS
Text
Example Uses process data, stack, heap,
mmap, ILP32 shared library
Private text, ILP32 main text, u-block, kernel
thread stacks/msts
LP64 shared library text, LP64 main text
Temp
Kernel2
Kernel
LP64 shmat
LP64 shmat w/ large superpage
reserved
kernel temporary attach, global buffer pool
kernel global w/ large page size
kernel global Virtual Region Usage
2 SAS
3 SAS
4 n/a
5 SAS
6 SAS
7 SAS
© Copyright IBM Corp. 2000
Name
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Instructions
Instruction
level
parallelism
(ILP)
IA-64 enables improving instruction level parallelism (ILP) by:
• Enabling the compiler/assembly writer to explicitly indicate
parallelism.
• Providing a three-instruction-wide word, called a bundle, that
facilitates parallel processing of instructions.
•Providing a large number of registers, enabling using different registers
for different variables and avoiding register contention.
Parallel Instruction Processing
IA-64
processor
A-64 instructions are bound in instruction groups. An instruction group is a
set of instructions which do not have read-after-write (RAW) or write-afterwrite (WAW) dependencies between them and may execute in parallel. In
any given clock cycle, the processor executes as many instructions from
one instruction group as it can, according to its resources.
An instruction group must contain at least one instruction; the number of
instructions in an instruction group is not limited. Instruction groups are
indicated in the code by cycle breaks (;;) placed in the code by the
assembly writer or compiler. An instruction group may also end
dynamically during run-time by a taken branch.
Instruction groups reduces the need to optimize the code for each new
micro architecture. Processors with additional resources will take
advantage of the existing ILP in the instruction group.
Continued on next page
-8 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Instructions
Instruction
groups and
bundles
-- continued
Instruction groups are composed of 41-bit instructions contained in
bundles. Each bundle contains three instructions, and a template field,
which are set during code generation, by a compiler, or the assembler. The
code generation process ensures instruction group assignment without
RAW or WAW dependency violations within the instruction group.
The template field maps each instruction to an execution unit. This allows
the processor to dispatch all three instructions in parallel.
Bundles are aligned at 16-byte boundaries.
template
Bundle structure
instruction slot 2
127
instruction slot 1
86
instruction slot 0
4
45
0
Template
The set of templates define the combinations of
functional units that can be invoked by a executing a
single bundle. This in turn lets the compiler schedule the
functional units in an order that avoids contention. The
template can also indicate a stop. The 24 available
templates are listed opposite.
M - is a memory function
I - is an integer function
F - is a floating point function
B - is a branch function
L - is a function involving a long immediate
"s" indicates a stop.
MII
MIsI
MLX*
MMI
MsMI
MFI
MMF
MIB
MBB
BBB
MMB
MFB
MIIs
MIsIs
MLXs*
MMIs
MsMIs
MFIs
MMFs
MIBs
MBBs
BBBs
MMBs
MFBs
* L+X is an extended type that is dispatched to the I-unit.
The template field can end the instruction group either at the end of the
bundle, or in the middle of the instruction group.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Instructions
Instruction set
-- continued
A basic IA-64 instruction has the following syntax:
[qp] mnemonic[.comp] dest=srcs
Where :
qp
Specifies a qualifying predicate register. The value of the
qualifying predicate determines whether the results of the
instruction are committed in hardware or discarded. When
the value of the predicate register is true (1), the instruction
executes, its results are committed, and any exceptions
that occur are handled as usual. When the value is false
(0), the results are not committed and no exceptions are
raised. Most IA-64 instructions can be accompanied by a
qualifying predicate.
mnemonic Specifies a name that uniquely identifies an IA-64
instruction.
comp
Specifies one or more instruction completers. Completers
indicate optional variations on a base instruction
mnemonic. Completers follow the mnemonic and are
separatedby periods.
dest
Represents the destination operand(s), which is typically
the result value(s) produced by an instruction.
srcs
Represents the source operands. Most IA-64 instructions
have at least two input source operands.
Continued on next page
-10 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Instructions
Branch
instructions
Guide
-- continued
All instructions beginning with “br.” are branches. The IA-64 architecture
provides three branch types:
• Relative direct branches, using 21-bit displacement that is appended
to the instruction pointer of the bundle containing the branch.
• Long branches goes to an explicit address by using an 60 bit
displacement from the current instruction pointer.
• Indirect branches, using 64-bit addresses in the branch registers
IA-64 allows multiple branches to be evaluated in parallel. The first taken
branch which is predicated true is taken.
Extended mnemonics are defined by assembler to cover most
combinations : br.cond, br.call, br.ia, br.ret, br.cloop, br.ctop, br.cexit
Branch prediction hints can be provided with branch hints as part of a
branch instruction, or with separate Branch Predict instructions (brp)
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Registers
Registers
IA-64 provides several register files that are visible to the programmer:
• 128 General registers
• 128 Floating-point registers
• 64 Predicate registers
• 8 Branch registers
• 128 Application registers
• Instruction Pointer (IP) register
Registers are referred to by a mnemonic denoting the register type and a
number. For example, general register 32 is named r32.
General Registers
Branch registers
63
63
0
0
br7
gr127
0
br0
gr0
Floating-point registers
81
Predicate registers
0
fr127
pr63
1
0.0
p0
fr0
Application registers
63
Instruction pointer
0
63
0
ar127
ar0
Continued on next page
-12 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Registers
-- continued
General
registers
IA-64 provides 128 64-bit general purpose
registers for all integer and multimedia
computation.
63
gr0
0
0
nat
0
gr1
• Register gr0 is a read-only register and
is always zero (0).
• 32 registers are static and global to the
process.
• 96 registers are stacked. These
registers are for argument passing and
local register stack frame. A portion of
these registers can also be used for
software pipelining.
gr2
gr31
gr32
gr127
Each register has an associated NaT bit,
indicating whether the value stored in the
register is valid.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Registers
-- continued
Floating-point
registers
IA-64 provides 128 82-bit floating-point
registers, for floating-point computations. All
floating-point registers are globally
accessible within the process. There are:
81
• 32 static floating-point registers
0
fr0
0.0
fr1
0.1
fr2
• 96 rotating floating-point registers, for
software pipelining
fr31
The first two registers (fr0 and fr1) are readonly:
fr32
• fr0 is read as +0.0
• fr1 is read as +1.0.
fr127
Each register contains three fields:
• 64-bit significand field
• 17-bit exponent field
• 1-bit sign field.
Continued on next page
-14 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Registers
Predicate
registers
Guide
-- continued
64 one-bit predicate registers enable controlling the execution of
instructions. When the value of a predicate register is true (1), the
instruction is executed. The predicate registers enable:
• validating/invalidating instructions
• eliminating branches in if/then/else logic blocks
There are:
• 16 static predicate registers
• 48 rotating predicate registers for controlling
software pipelining
0
pr0
pr1
pr2
pr15
pr16
pr63
Instructions that are not explicitly preceded by a
predicate, defaults to the first predicate register, pr0,
which is read-only, and is always true (1).
Whenever in a program encounters a branch condition, like the ‘if-thenelse’ condition, it depends on the outcome of the condition which branch
gets executed. Branch prediction used to be an often used solution, where
the processor tried to predict which branch would be taken and then
execute that branch in advance. Ofcourse, if the outcome was wrong, then
a performance penalty was met because the branch taken was discarded
and the other branch had to be executed...
The IA-64 executes all branches in parallel, where the predication register
is used to stop that branch of execution. This way the processor can
process ‘out-of-order execution’ by just executing all branches without
performance penalty.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Registers
-- continued
Branch
registers
63
Eight 64-bit branch registers are used to specify
the branch target addresses for indirect
branches.
0
br0
br1
br2
The branch registers streamline call/return
branching
br7
IA-64 improves branch handling by:
• providing the means to minimize branches in the code through the
use of qualifying predicates
• providing support for special branch
A qualifying predicate is a predicate register indicating whether or not the
instruction is executed. When the value of the register is true (1), the
instruction is executed. When the value of the register is false (0), the
instruction is executed as a NOP. Instructions that are not preceded by a
predicate explicitly, assume the first predicate register, p0, which is
always true.
Predication enables you to convert a control dependency to a data
dependency, thus eliminating branches in the code. An instruction is
control dependent if it depends on a branch instruction to execute.
Instructions are considered to be data dependent if the first produces a
result that is used by the second, or if the second instruction is data
dependent on the first through a third instruction. Dependent instructions
cannot be executed in parallel. You cannot change the execution
sequence of dependent instructions.
Continued on next pag
-16 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Registers
-- continued
Application
registers
63
128 special purpose registers are used for
various functions. Some of the more
commonly used application registers have
assembler aliases.
For example, ar66 is used as the Epilogue
Counter (EC) and is called ar.ec.
0
ar0
KR0
ar7
KR7
ar16
RSC
ar17
BSP
ar18
BSPSTORE
ar19
RNAT
ar32
CCV
ar36
UNAT
ar40
FPSR
ar44
ITC
ar64
PFS
ar65
LC
ar66
EC
ar127
Instruction
pointer (IP)
The 64-bit instruction pointer holds the address of the bundle of the
currently executing instruction. The IP cannot be directly read or written, it
increments as instructions are executed. Branch instructions set the IP to a
new value. The IP is always 16-byte aligned.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Registers
-- continued
Register
validity
63
Speculative memory access creates a need to
delay exception handling. This is enabled by
propagating exception conditions.
gr0
0
0
nat
0
gr1
gr2
Each general register has an a corresponding
NaT (Not a Thing) Bit. The NaT bits enable
propagating validity/invalidity of a speculative
load result.
Floating-point registers use a special instance
of pseudo-zero, called NaTVal. NaTVal is a
floating-point register value used to propagate
valid/invalid results of speculative loads of
floating-point data.
gr31
gr32
gr127
If data needs to get from the memory to the processor, there’s always a
delay because it’ll take a while to get there. This is called ‘memory latency’.
In an attempt to eliminate this time, the processor tries to read the memory
beforehand.
If data has been read in in advance and then other data has been written
back to that exact location, the already read in data becomes invalid.
-18 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Operations
Software
pipelining
loops
Loop performance is traditionally improved through software techniques.
However, these techniques entail significant additional code:
• Loop unrolling requires multiple copies of the original loop in the
unrolled loop. The loop instructions are replicated and the end code
adjusted to eliminate the branch.
• Software pipelining requires adding prolog code to fill the execution
pipe and epilog code that drains it. Software pipelining is a method
that enables the processor to execute, in any given time, several
instructions in various stages of the loop.
IA-64 provides hardware support for software pipelining loops, eliminating
the need for additional prolog and epilog code through the use of:
• special branch instructions
• Loop count (LC) and epilogue count (EC) application registers
• rotating registers
Rotating registers are registers which are rotated by one register position
on each loop execution. The logical names of the registers are rotated in a
wrap-around fashion, so that logical register X is logical register X+1 after
one rotation. The predicate, floating-point and general registers can be
rotated.
IA-64 provides support for special branch instructions. One example is the
br.cloop instruction, used for simple counted loops.
The cloop branch instruction uses the LC application register, and not a
qualifying predicate to determine the branch condition.
The cloop branch checks whether the LC register is zero. If it is not, it
decrements LC and the branch is taken. After the last iteration LC is zero
and the branch is not taken, avoiding a branch misprediction.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
Reduced
memory
access costs
-- continued
As current processors increase in speed and parallelism, more scheduling
opportunities are lost while memory is accessed.
IA-64 allows you to eliminate many memory accesses through the use of
large register files to manage work in progress, and by allowing better
control of the memory hierarchy.
Furthermore, the cost of the remaining memory accesses is dramatically
reduced by moving load instructions earlier in the code. Thus hiding
memory latency, which is the time required by the processor, between an
issuance of a load instruction and the moment when the result of this
instruction can be used. This enables the processor to bring the data in
time, and avoid stalling the processor. Memory latency is hidden through
the use of:
• Data speculation - the execution of an operation before its data
dependency is resolved.
• Control speculation - the execution of an instruction before its control
dependency is resolved.
Hiding memory latency
early load
dependency
ld
dependency
check validity
The large number of registers in IA-64 enable multiple computations to be
performed without having to store temporary data in memory. This reduces
the number of memory accesses.
Continued on next page
-20 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Operations
-- continued
Memory access is supported through the load (ld) and store (st)
instructions. All other integer, floating-point and branch instructions use the
registers as operands.
IA-64 enables you to hide the memory latencies of the remaining load
instructions, by placing speculative loads, prior to coding barriers. Thus the
stall caused by memory latency is minimized. This also enables more
opportunities for parallelism. When you use speculative loads, error/
exception detection is deferred until final result is actually required:
• If no error/exception is detected the latency is hidden.
• If an error/exception is detected then memory accesses and
dependent instructions must be redone by an exception handler.
A-64 provides an advanced load instruction (ld.a), that allows you to move
potentially data dependent loads earlier in the code.
To verify the data speculation, a check load instruction (ld.c) must be
placed at the location of the original load instruction.
If the contents of the memory address have not changed since the
advanced load, the speculation succeeded, and the memory latency is
hidden. If the contents of the memory address have changed by a store
instruction, the ld.c instruction repeats the load.
Data speculation does not defer exceptions. For example page faults are
taken immediately.
Also, IA-64 provides a control-speculative load instruction (ld.s), which
executes the load while speculating the results of the governing branch.
Control-speculative loads are also referred to as speculative loads.
To verify the load, a check instruction (chk.s) is placed at the location of the
original load. IA-64 uses a NaT bit/NaTVal, to track the success of the
load. If the NaT bit/NaTVal indicates a deferred exception, the chk.s
instruction jumps to correction code that repeats all dependent
instructions. The correction code is generated by a compiler or assembly
writer.
If the load is successful, the speculation succeeded, and the memory
latency is hidden.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
-- continued
Then there’s also a combined speculation load (ld.sa) which enables
placing a load before a control and a data barrier. Use this type of
speculative load to advance a load around a procedure call.
To verify the speculation, a special check instruction (chk.a) is placed at
the location of the original load instruction. If the load is successful, the
speculation succeeded, and the memory latency is hidden.
If an exception was generated, or the data was invalidated, the chk.a
instruction jumps to correction code that repeats all dependent
instructions. The correction code is generated by a compiler or assembly
writer.
Procedure
calls
The traditional use of a procedure stack in memory for procedure call
management demands a large overhead. IA-64 uses the general register
stack for procedure call management, thus eliminating the frequent
memory accesses. The general register stack consists of 96 general
registers, starting at r32, used to pass parameters to the called procedure
and store local variables for the currently executing procedure. The new
structure of a register stack allows:
• the caller procedure to pass parameters through registers to the
called procedure
• dynamic allocation of local registers for the currently executing
procedure
• allocating a maximum of 96 logical registers for each function
IA-32
IA-64
Procedure A
call B ...
Procedure A
call B
Procedure B
save current register state
...
restore previous register state
return
Procedure B
alloc no save!
...
no restore!
return
Continued on next page
-22 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Operations
-- continued
The general register stack is divided into two
subsets:
gr0
• Static: The first 32 physical registers
(r0-r31) are permanent registers, visible
to all procedures, in which global
variables are placed.
Global
Registers
gr31
gr32
Procedure
Frame
• Stacked: The other 96 physical
registers behave like a stack. The
procedure code allocates up to 96 input
and output registers for a procedure
frame. An integral mechanism ensures
that a stack overflow or underflow
never occurs.
Stacked
Registers
As each procedure frame is allocated, the
previous frame is hidden, and the first
register in the frame is renamed as logical
register r32.
Using small register frames eliminates or
reduces the need for saving and restoring
registers to and from memory, when
allocating a new register stack frame.
gr127
When a procedure call is executed, the called procedure receives a
procedure frame which contains the output registers of the caller as input.
The called procedure can resize the frame to include its own input, local
and output area, using the alloc instruction. For each subsequent call, this
sequence is repeated, and a new procedure frame is created.
When the procedure returns, the processor unwinds the register stack, the
current frame is released, and the previous procedure’s frame is restored.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
-- continued
Register stack
engine
Global
Registers
gr31
gr32
Procedure
Frame
Stacked
Registers
memory
IA-64 uses a hardware
mechanism called a Register
Stack Engine (RSE), which
operates transparently in the
background, to ensure that an
overflow does not occur, and that
the contents of the registers are
always available. The RSE is not
visible to the software.
gr0
RSE
Using a register stack reduces the
need to perform memory saves.
However, when a procedure tries
to use more physical registers
than remain on the stack, a
register stack overflow could
occur.
When the stack fills up, the RSE
saves logical registers to memory,
thus freeing them. The stored
registers are restored in the same
way when necessary.
gr127
Through this mechanism, the
RSE offers an unlimited number
of physical registers for allocation.
Continued on next page
-24 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Operations
Floating point
and
multimedia
-- continued
IA-64 provides high floating-point performance with full IEEE floating-point
support for single, double, and double-extended formats.
Also special support is provided for multimedia, or data-parallel
applications:
• integer data and SIMD computations, similar to the MMX[tm]
technology.
• floating-point data and SIMD-FP computations, similar to IA-32
Streaming SIMD Extensions .
These floating-point features help improve IA-64 floating-point
performance:
• 128 floating-point registers.
• A multiply and accumulate instruction (fma), with four different
floating-point registers for operands (f=a * b + c). This instruction
enables performing a multiply and add in the same number of cycles
as one add or multiply instruction.
• Load and store to and from memory. You can also load from memory
into two floating-point registers.
• Data transfer between floating-point and general registers.
• Multiple status fields register, enables speculation on floating-point
operations.
• Quick conversion from integer to floating-point and vice-versa.
• Rotating floating-point registers.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
-- continued
Integer multimedia is provided by defining a
set of instructions which treat the general
registers as 8x8, 4x16, or 2x32 bit
elements, and by providing specific
instructions for operating on these data
elements. IA-64 multimedia support is
semantically compatible with the MMX[tm]
Technology. Three major types of
instructions are provided:
• Addition and subtraction (including 3
forms of saturating arithmetic)
63
0
a3
a2
a1
a0
b3
b2
b1
b0
a3+b3
a2+b2
a1+b1
a0+b0
• Multiplication
• Left shift, signed and unsigned right
shift
• Pack and unpack to convert between
different element sizes.
Floating-point multimedia is provided
through a set of instructions which treat the
floating-point registers as 2x32 bit
elements.
IA-64 provides 128 82-bit floating-point registers. However the floatingpoint data type is 80 bits.
Intermediate computation values can contain 82 bits. This enables
software divide and square root computation, comparable to hardware
functions, while taking advantage of wide machines. These fast software
divides and square roots result in valid 80-bit IEEE values.
Floating-point Register
81 80
Exponent
0
63
Significand
Sign
Continued on next page
-26 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
Guide
IA-64 Operations
-- continued
For floating-point multimedia operations the floating-point register is
divided as shown in the graphic below
31
63
81 80
0
Exponent Single-precision FP Single-precision FP
IA-64 provides four separate status fields (sf0-sf3) enabling four different
computational environments. Each status field contains dynamic control
and status information for floating-point operations.
The FPSR contains the four status fields and a traps field that enable
masking the IEEE exception events and denormal operand exceptions.
This register also includes 6 reserved bits which must be 0.
Floating-point status register
63
6
0
sf3
sf2
sf1
sf0
13
13
13
13
traps
6
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
-- continued
Multimedia
instructions
Multimedia instructions treat the general registers as concatenations of
eight 8-bit, four 16-bit, or two 32-bit elements. They operate on each
element independently and in parallel. The elements are always aligned on
their natural boundaries within a general register. Most multimedia
instructions are defined to operate on multiple element sizes. Three
classes of multimedia instructions are defined: arithmetic, shift and data
arrangement.
Processor
Abstraction
Layer (PAL)
IA-64 firmware consists of three major components
• Processor Abstraction Layer (PAL)
• System Abstraction Layer (SAL)
• Extensible Firmware Interface (EFI) layer
PAL provides a consistent firmware interface to abstract processor
implementation-specific features.
The System Abstraction Layer (SAL) is a firmware layer which isolates
operating system and other higher level software fromimplementation
differences in the platform, while PAL is the firmware layer that abstracts
the processor implementation.
Operating System Software
Transfers to OS
Entrypoints
for Hardware
Events
OS Boot
Handoff
EFI Procedure
Calls
Extensible Firmware
Interface (EFI)
OS Boot
Selection
SAL Procedure
Calls
Platform/System Abstraction Layer (SAL)
Access to
Platform
Resources
Instruction
Execution
Interrupts,
Traps and
Faults
PAL Procedure
Calls
Transfers to SAL
Entrypoints
Processor Abstraction Layer
(PAL)
Processor (Hardware)
Non-performance Critical
Hardware Events, e.g
Reset, Machine Checks
Performance Critical
Hardware Events
e.g. Interrupts
Platform (Hardware)
Continued on next page
-28 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
Interrupts
Guide
-- continued
Interrupts are events that occur during IA-32 or IA-64 instruction
processing, causing the flow control to be passed to an interrupt handling
routine. In the process, certain processor state is saved automatically by
the processor. Upon completion of interrupt processing, a return from
interrupt (rfi) is executed which restores the saved processor state.
Execution then proceeds with the interrupted IA-32 or IA-64 instruction.
From the viewpoint of response to interrupts, the processor behaves as if it
were not pipelined. That is, it behaves as if a single IA-64 instruction (along
with its template) is fetched and then executed; or as if a single IA-32
instruction is fetched and then executed. Any interrupt conditions raised by
the execution of an instruction are handled at execution time, in sequential
instruction order. If there are no interrupts, the next IA-64 instruction and its
template, or the next IA-32 instruction, are fetched.
Interrupt
definitions
Depending on how an interrupt is serviced, interrupts are divided into: IVAbased interrupts and PAL-based interrupts.
• IVA-based interrupts are serviced by the operating system. IVAbased interrupts are vectored to the interrupt Vector Table (IVT)
pointed to by CR2, the IVA control register
• PAL-based interrupts are serviced by PAL firmware, system
firmware, and possibly the operating system. PAL-based interrupts
are vectored through a set of hardware entry points directly into PAL
firmware.
interrupts are divided into four types: Aborts, Interrupts, Faults, and Traps.
Aborts
A processor has detected a Machine Check (internal malfunction), or a
processor reset. Aborts can be either synchronous or asynchronous with
respect to the instruction stream. The abort may cause the processor to
suspend the instruction stream at an unpredictable location with partially
updated register or memory state. Aborts are PAL-based interrupts.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
-- continued
Machine
Checks (MCA)
A processor has detected a hardware error which requires immediate
action. Based on the type and severity of the error the processor may be
able to recover from the error and continue execution. The PALE_CHECK
entry point is entered to attempt to correct the error.
Processor
Reset (RESET)
A processor has been powered-on or a reset request has been sent to it.
The PALE_RESET entry point is entered to perform processor and system
self-test and initialization.
External
device
Interrupts
An external or independent entity (e.g. an I/O device, a timer event, or
another processor) requires attention. Interrupts are asynchronous with
respect to the instruction stream. All previous IA-32 and IA-64 instructions
appear to have completed. The current and subsequent instructions have
no effect on machine state. Interrupts are divided into Initialization
interrupts, Platform Management interrupts, and External interrupts.
Initialization and Platform Management interrupts are PAL-based
interrupts; external interrupts are IVA-based interrupts.
Initialization
Interrupts
(INIT)
A processor has received an initialization request. The PALE_INIT entry
point is entered and the processor is placed in a known state.
Platform
Management
Interrupts
(PMI)
A platform management request to perform functions such as platform
error handling, memory scrubbing, or power management has been
received by a processor. The PALE_PMI entry point is entered to service
the request. Program execution may be resumed at the point of interrupt.
PMIs are distinguished by unique vector numbers. Vectors 0 through 3 are
available for platform firmware use and are present on every processor
model. Vectors 4 and above are reserved for processor firmware use. The
size of the vector space is model specific.
Continued on next page
-30 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
Guide
-- continued
External
Interrupts (INT)
A processor has received a request to perform a service on behalf of the
operating system. Typically these requests come from I/O devices,
although the requests could come from any processor in the system
including itself. The External Interrupt vector is entered to handle the
request. External Interrupts are distinguished by unique vector numbers in
the range 0, 2, and 16 through 255. These vector numbers are used to
prioritize external interrupts. Two special cases of External Interrupts are
Non-Maskable Interrupts and External Controller Interrupts.
Non-Maskable
Interrupts
(NMI)
Non-Maskable Interrupts are used to request critical operating system
services. NMIs are assigned external interrupt vector number 2.
External
Controller
Interrupts
(ExtINT)
External Controller Interrupts are used to service Intel 8259A-compatible
external interrupt controllers. ExtINTs are assigned locally within the
processor to external interrupt vector number 0.
Faults
The current IA-64 or IA-32 instruction which requests an action which
cannot or should not be carried out, or system intervention is required
before the instruction is executed. Faults are synchronous with respect to
the instruction stream. The processor completes state changes that have
occurred in instructions prior to the faulting instruction. The faulting and
subsequent instructions have no effect on machine state. Faults are IVAbased interrupts.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
Traps
-- continued
The IA-32 or IA-64 instruction just executed requires system intervention.
Traps are synchronous with respect to the instruction stream. The trapping
instruction and all previous instructions are completed. Subsequent
instructions have no effect on machine state. Traps are IVA-based
interrupts.
Aborts
RESET
MCA
Interrupts
Faults
Traps
INIT
PMI
INT
(NMI,ExtINT,...)
PAL-based interrupts
IVA-based interrupts
Continued on next page
-32 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
Interrupt
programming
model
Guide
-- continued
When an interrupt event occurs, hardware saves the minimum processor
state required to enable software to resolve the event and continue. The
state saved by hardware is held in a set of interrupt resources, and
together with the interrupt vector gives software enough information to
either resolve the cause of the interrupt, or surface the event to a higher
level of the operating system. Software has complete control over the
structure of the information communicated, and the conventions between
the low-level handlers and the high-level code. Such a scheme allows
software rather than hardware to dictate how to best optimize performance
for each of the interrupts in its environment. The same basic mechanisms
are used in all interrupts to support efficient IA-64 low-level fault handlers
for events such as a TLB fault, speculation fault, or a key miss fault.
On an interrupt, the state of the processor is saved to allow an IA-64
software handler to resolve the interrupt with minimal bookkeeping or
overhead. The banked general registers provide an immediate set of
scratch registers to begin work. For low-level handlers (e.g. TLB miss)
software need not open up register space by spilling registers to either
memory or control registers.
Upon an interrupt, asynchronous events such as external interrupt delivery
is disabled automatically by hardware to allow IA-64 software to either
handle the interrupt immediately or to safely unload the interrupt resources
and save them to memory. Software will either deal with the cause of the
interrupt and rfi back to the point of the interrupt, or it will establish a new
environment and spill processor state to memory to prepare for a call to
higher-level code. Once enough state has been saved (such as the IIP,
IPSR, and the interrupt resources needed to resolve the fault) the low-level
code can re-enable interrupts by restoring the PSR.ic bit and then the
PSR.i bit. Since there is only one set of interrupt resources, software must
save any interrupt resource state the operating system may require prior to
unmasking interrupts or performing an operation that may raise a
synchronous interrupt (such as a memory reference that may cause a TLB
miss).
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 34
Guide
Draft Version for review, Sunday, 15. October 2000, ia64_hardware_overview.fm
IA-64 Operations
PSR.ic
Interrupt state
collection bit
-- continued
The PSR.ic (interrupt state collection) bit supports an efficient nested
interrupt model. Under normal circumstances the PSR.ic bit is enabled.
When an interrupt event occurs, the various interrupt resources are
overwritten with information pertaining to the current event. Prior to saving
the current set of interrupt resources, it is often advantageous in a miss
handler to perform a virtual reference to an area which may not have a
translation. To prevent the current set of resources from being overwritten
on a nested fault, the PSR.ic bit is cleared on any interrupt. This will
suppress the writing of critical interrupt resources if another interrupt
occurs while the PSR.ic bit is cleared. If a data TLB miss occurs while the
PSR.ic bit is zero, then hardware will vector to the Data Nested TLB fault
handler.
-34 of 34 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for Review October 15, 2000 12:13 pm
Instructor Guidepower_hardware_overview.fm
Unit 3. Power Hardware Overview
Objectives
The Objectives for this lesson are:
• Provide an overview of the e-server p series systems and their
processors.
• List the registers available to the program and describe the internal
use.
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-1
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power Hardware Overview
e-server
p-series or
RS/6000
introduction
This section introduces RS/6000, giving a brief history of the products,
an overview of the RS/6000 design, and a description of key RS/6000
technologies.
The RS/6000 family combines the benefits of UNIX computing with IBMs
leading-edge RISC technology in a broad product line - from powerful
desktop workstations ideal for mechanical design, to workgroup servers
for departments and small businesses, to enterprise servers for medium
to large companies for ERP and server consolidation applications, up to
massively parallel RS/6000 SP systems that can handle demanding
scientific and technical computing, business intelligence, and Web
serving tasks. Along with AIX, IBMs award winning UNIX operating
system, and HACMP, the leading high availability clustering solution, the
RS/6000 platform provides the power to create change and has the
flexibility to manage it with a wide variety of applications that provide real
value.
RS/6000
History
The first RS/6000 was announced February 1990 and shipped June
1990. Since then, over 1,100,000 systems have shipped to over 132,000
customers.
The next figure summarizes the history of the RS/6000 product line,
classified by machine type. For each machine type, the I/O bus
architecture and range of processor clock speeds are indicated. The
figure shows the following:
• In the past, RS/6000 I/O buses were based on the Micro Channel
Architecture (MCA). Today, RS/6000 I/O buses are based on the
industry-standard Peripheral Component Interface (PCI) Architecture.
• Processor speed, one key element of RS/6000 system performance,
has increased dramatically over time.
• There have been many machine types over the entire RS/6000
history. In recent years, there has been considerable effort to reduce
the complexity of the model offerings without creating gaps in the
market coverage.
Continued on next page
-2
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Guidepower_hardware_overview.fm
Draft Version for Review October 15, 2000 12:13 pm
Power Hardware Overview
-- continued
RS/6000
history
7011 (33 to 80 MHz)
Micro Channel Workstations
7248 (100 to 133 MHz)
PCI Workstations
7006 (80 to 120 MHz)
Micro Channel Entry Desktops
7009 (80 to 120 MHz)
Micro Channel Compact Servers
7013 (20 to 200 MHz)
Micro Channel Deskside Systems
7012 (20 to 200 MHz)
Micro Channel Desktop Systems
7015 (25 to 200 MHz)
Micro Channel Rack Systems
7024 (100 to 233 MHz)
PCI Deskside Systems
7025 (166 to 500 MHz)
PCI Workgroup Servers Deskside Systems
7043 (166 to 375 MHz)
PCI Workstations & Workgroup Servers
7044 (333 to 400 MHz)
PCI Workstations &
Workgroup Servers
7046 (375 MHz)
PCI Workgroup Servers - Rack Systems
7026 (166 to 500 MHz)
PCI Workgroup Servers - Rack Systems
7017 (125 to 450 MHz)
PCI Enterprise Servers
SP1, SP2, SP
All Node Types
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Today
RISC CPU
320 520
The RISC CPU was the first CPU for the RS/6000 series of systems the
CPU consist of four chips and runs at a speed of 33 Mhz. The CPU had
a outstanding floating point performance at the time. The CPU was used
in the 7012 and 7013 system model 320 - 380 and 520 - 580.
RISC II CPU
390 590
The RISC II has enhanched features over the first RISC design and runs
up to 200 Mhz. The CPU was used in the 7012 and 7013 system model
390 and 590.
Continued on next page
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-3
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power Hardware Overview
-- continued
PowerPC and
Power2 Cpu
family
PowerPc CPUs started as a joint effort between Motorola Apple and IBM
the family consist of PowerPc, PPc601, PPc604 and PPc604e. These
CPUs are very close to those prodused by Motorola and used in Apple
systems, currently the PPc604e CPU is used in model f50, b50, and 43p
Power3 and
Power3-II
CPUs
The POWER3 microprocessor introduces a new generation of 64-bit
processors especially designed for high performance and visual
computing applications. POWER3 processors replace the POWER2 and
the POWER2 Super Chips (P2SC) in high-end RS/6000 workstations
and SP nodes. The RS/6000 44P 7044 Model 270 workstation features
the POWER3-II microprocessor as well as the POWER3-II based SP
nodes.
The POWER3 implementation of the PowerPC architecture provides
significant enhancements compared to the POWER2 architecture. The
SMP- capable POWER3 design allows for concurrent operation of
fixed-point instructions, load/store instructions, branch instructions, and
floating-point instructions. Compared to the P2SC, which reaches its
design limits at a clock frequency of 160 MHz, POWER3 is targeting up
to 600 MHz by exploiting more advanced chip manufacturing processes,
such as copper technology. The first POWER3-based system, RS/6000
43P 7043 Model 260, runs at 200 MHz as well as the POWER3 wide and
thin nodes for the SP.
Features of the POWER3, exceeding its predecessor (P2SC), include:
• A second load-store unit
• Improved memory access speed
• Speculative execution
F lo a t in g
P o in t
U n it
F lo a tin g
P o in t
U n it
F ix e d
P o in t
U n it
F ix e d
P o in t
U n it
F ix e d
P o in t
U n it
L D /S T
U n it
L D /S T
U n it
F P U 1
F P U 2
F X U 1
F X U 2
F X U 3
L S 1
L S 2
B r a n c h / D is p a t c h
R e g is te r b u f fe r s f o r
r e g is te r r e n a m in g :
2 4 F P
1 6 In te g e r
B r a n c h h is t o r y t a b le 2 0 4 8 e n t r ie s
B r a n c h t a r g e t c a c h e 2 5 6 e n t r ie s
3 2 K B , 1 2 8 -w a y
6 4 K B , 1 2 8 -w a y
M e m o r y M g m t U n it
In s tr u c t io n C a c h e
M e m o r y M g m t U n it
D a ta C a c h e
IU
D U
3 2
B y te s
3 2
B y te s
B u s In t e r fa c e U n it L 2 C o n t r o l, C lo c k
B IU
1 6 B y te s
@ 1 0 0 M H z = 1 .6 G B /s
3 2 B y te s
@ 2 0 0 M H z = 6 .4 G B /s
D ir e c t
M a p p e d
C P U r e g is te r s :
3 2 x 6 4 -b it in t e g e r
( F ix e d P o in t )
3 2 x 6 4 -b it F P
( F lo a tin g P o in t )
L 2 C a c h e
1 -1 6 M B
6 X X
B u s
Continued on next page
-4
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Instructor Guidepower_hardware_overview.fm
Draft Version for Review October 15, 2000 12:13 pm
Power Hardware Overview
RS64 and
RS64 II CPUs
-- continued
The RS64 microprocessor, based on the PowerPC Architecture, was
designed for leading-edge performance in OLTP, e-business, BI, server
consolidation, SAP, Notesbench, and Web serving for the commercial
and server markets. It is the basis for at least four generations of
RS/6000 and AS/400 enterprise server offerings.
The RS64 processor focuses on commercial performance with emphasis
on conditional branches with zero or one cycle incorrect branch predict
penalty, contains 64 KB L1 instruction and data caches, has a one cycle
load support, four superscalar fixed point pipelines, and one floating
point pipeline. There is an on-board bus interface (BIU) that controls both
the 32 MB L2 bus interface and the memory bus interface.
RS64 and RS64 II are defined by the following specifications:
• 125 MHz RS64/262 MHz RS64 II on the RS/6000 Model S70
• 262 MHz RS64 II on the RS/6000 Model S70 Advanced
• 340 MHz RS64 II on the RS/6000 Model H70
• 64 KB on-chip, L1 instruction cache
• 64 KB on-chip four-way set associative data cache
• 32 MB L2 cache
• Superscalar design with integrated integer, floating-point, and branch
units
• Support for up to 64-way SMP configurations (currently 12-way)
• 128-bit data bus
• 64-bit real memory addressing
• Real memory support for up to one terabyte (240)
• CMOS 6S2 using a 162 mm2 die, 12.5 million transistors
S im p le
F ix e d
P o in t
U n it
S im p le
C o m p le x
F ix e d
P o in t U n it
F lo a tin g
P o in t
U n it
Load/
S to re
U n it
B r a n c h /D is p a tc h
M e m o r y M g m t U n it
In s tr u c tio n C a c h e
M e m o r y M g m t U n it
D a ta C a c h e
IU
DU
3 2 B y te s
B IU
3 2 B y te s
B u s In te r fa c e U n it L 2 C o n tr o l, C lo c k
3 2 B y te s
L2 Cache
1 -3 2 M B
1 6 B y te s
6XX B us
Continued on next page
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-5
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power Hardware Overview
RS64 III
-- continued
The RS64 III processor is designed to perform applications that place
heavy demands on system memory. The RS64 III architecture addresses
both the need for very large working sets and low latency. Latency is
measured by the number of CPU cycles that elapse before requested
data or instructions can be utilized by the processor.
The RS64 III processors combine IBM advanced copper chip technology
with a redesign of critical timing paths on the chip to achieve greater
throughput. The L1 instruction and data caches have been doubled to
128 KB each. New circuit design techniques were used to maintain the
one cycle load-to-use latency for the L1 data cache.
L2 cache performance on the RS64 III processor has been significantly
improved. Each processor has an on-chip L2 cache controller and an
on-chip directory of L2 cache contents. The cache is four-way set
associative. This means that directory information for all four sets is
accessed in parallel. Greater associativity results in more cache hits and
lower latency, which improves commercial performance.
Using a technique called Double Data Rate (DDR), the new 8 MB Static
SRAM used for L2 is capable of transferring data twice during each clock
cycle. The L2 interface is 32 bytes wide and runs at 225 MHz (half
processor speed), but, because of the use of DDR, it provides 14.4 GBps
of throughput.
In summary, the RS64 III features include:
• 128 KB on-chip L1 instruction cache
• 128 KB on-chip L1 data cache with one cycle load-to-use latency
• On-chip L2 cache directory that supports up to 8 MB of off-chip L2
SRAM memory
• 14.4 GBps L2 cache bandwidth
• 32 byte on-chip data buses
• 4-way superscalar design
• Five stage deep pipeline
• The Model S80 uses the 450 MHz RS64 III 64-bit copper-chip
technology
• The Model M80 uses the 500 MHz RS64 III 64-bit copper-chip
technology
• The Model F80 and the H80 use 450 or 500MHz RS64 III 64-bit
copper-chip technology
Continued on next page
-6
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for Review October 15, 2000 12:13 pm
Instructor Guidepower_hardware_overview.fm
Power Hardware Overview
-- continued
Power4 or
Gigaprocessor
Copper SOI
CPU
POWER4 is a new processor initiative from IBM. It is comprised of two
64-bit 1 GHz five issue superscalar cores that have a triple level cache
hierarchy. It has a 10 GBps main memory interface with a 45 GBps
multiprocessor interface. IBM is utilizing the 0.18 micron copper
silicon-on-insulator technology in its manufacture. The targeted market is
the Enterprise Server or servers in e-business. It is currently in the
design stage.
System Bus
information
All current systems in the RS/6000 family are equiped with PCI buses,
the PCI architecture provides an industry standard specification and
protocol that allows multiple adapters access to system resources
through a set of adapter slots.
Each PCI bus has a limit on the number of slots (adapters) it can
support. Typically, this can range from two to six. To overcome this limit,
the system design can implement multiple PCI buses. Two different
methods can be used to add PCI buses in a system. These two methods
are:
• Secondary PCI Bus, The simplest method to add PCI slots when
designing a system is to add a secondary PCI bus. This bus is bridged
onto a primary bus using a PCI-to-PCI bridge chip.
• Another method of providing more PCI slots is to design the system
with two or more primary PCI buses. This design requires a more
sophisticated I/O interface with the system memory.
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-7
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power CPU Overview
32-bit
hardware
characteristics
32-bit Power and PowerPC processors all have the following features in
common:
User registers
• 32 general-purpose integer registers, each 32 bits wide (GPRs)
• 32 floating-point registers, each 64 bits wide (FPRs)
• A 32-bit Condition Register (CR)
• A 32-bit Link Register (LR)
• A 32-bit Count Register (CTR)
System Registers
• 16 Segment Registers (SRs)
• A Machine State Register (MSR)
• A Data Address Register (DAR)
• Two Save and Restore Registers (SRRs)
• 4 special purpose (SPRG) registers (PowerPC only)
All instructions are 32 bits long. The Data Address Register contains the
memory address that caused the last memory-related exception.
SRRs are used to save information when an interrupt occurs
• SRR0 points to the instruction that was running when the interrupt
occurred
• SRR1 contains the contents of the MSR when the interrupt
occurred
SPRGs are used for general operating system purposes, requiring
per-processor temporary storage. It provides fast state saves and
support for multi-processing environments
General
purpose
registers
General Purpose Registers (GPRs) (often just called Rs) used for loads,
stores, and integer calculations
No memory-to-memory operations are provided.This always needs to go
through registers
Condition
register
The condition register (CR) contains bits set by the results of compare
instructions. It’s treated as 8 4-bit registers.
The bits are used to test for less-than, greater-than, equal, and overflow
conditions.
Continued on next page
-8
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for Review October 15, 2000 12:13 pm
Instructor Guidepower_hardware_overview.fm
Power CPU Overview
-- continued
Link register
The link register (LR) is set by some branch instructions.
Its content points to the instruction which has to be executed immediately
after the branch. It typically is used in subroutine calls to find out where
to return to.
Count register
The Count Register (CTR) has two uses :
• It can be decremented, tested, and used to decide whether to take
a branch, all from one branch instruction
• It can contain the target address for a branch instruction
Machine state
register
The MSR controls many of the current operating characteristics of the
processor. Among others are :
• Privilege Level (Supervisor vs. Problem or Kernel vs. User)
• Addressing modes (virtual vs. real)
• Interrupt enabling
• Little-endian vs. Big-endian mode
Instruction set
A single instruction generally modifies only one register or one memory
location. Exceptions to this are “multiple” and “update” operations
The format of an instruction is:
• An opcode mnemonic
• An optional set of option bits
• 0, 1, 2, or 3 registers
• 0 or 1 memory locations, expressed as an offset added
to/subtracted from a register
The first two may be combined into an “extended mnemonic”
For example of the format Umeans the address in r3 + 24
General Purpose Registers are named “r0” - “r31”
Although most instructions are the same, the mnemonics for POWER
and PowerPC are often different. POWER mnemonics are generally
simpler and shorter, while PowerPC mnemonics are longer, but more
explicit.These differences are because PowerPC was developed with
64-bit in mind.
Note: the actual opcodes generated by the assembler for these
instructions are identical
Continued on next page
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-9
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power CPU Overview
Register to
register
operations
-- continued
These types of operations will always have at least 2 registers listed,
where the first is the target for the result of the instruction, and the others
provide the input to the operation.
Immediate operations are shown as a register with an offset.
“Immediate” means that a constant value is involved.The value is built
right into the instruction.
Examples :
• RUUUU# Logical ORs r4 and r5, result into r3
• DGGLU[U # Adds 0x48 to r1, result into r1
Register to
memory
operations
Register-Memory Operations always have one register and one memory
location. The register is always listed first.
The size of the memory location is specified in the opcode :
• b = byte (8 bits)
• h = halfword (16 bits)
• w = word (32 bits)
• d = doubleword (64 bits)
All opcodes beginning with “l” are loads and all opcodes beginning with
“st” are stores.
Continued on next page
-10
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for Review October 15, 2000 12:13 pm
Instructor Guidepower_hardware_overview.fm
Power CPU Overview
Register to
memory
operation
examples
-- continued
Examples:
• OZ]UU# Loads 32 bits from address 4+r30 into
r31.High 32-bits cleared on 64-bit processor
• VWGUU# Stores 64 bits from r3 to address r29 - 8.
Invalid operation on 32-bit processor
• OE]U[U # Loads 8 bits from address 27+r1 into r0.
Top 24/56 bits are cleared
• VWKU[U# Stores low 16 bits from r3 to address
0x56+r1
Notice that the load instructions also have a “z” in their mnemonics. The
“z” stands for “zero,” and is intended to make clear that these instructions
clear any bits in the target register that were not actually copied from
memory.
In case you were wondering, there are load instructions without the “z”.
lwa and lha are “algebraic” loads. This means that the value being
loaded is sign-extended to fill out the rest of the register. This is used
when loading a signed value - if a halfword had a negative value, lhz
would make it a positive, but lha would preserve the value’s
“negativeness.”
Compare
instructions
There are four variations of compare instructions , all beginning with
“cmp”. They compare two values :
• Register and register, or
• Register and immediate value (i.e. constant value)
The result of the comparison iis placed in the Condition Register (CR)
where the various bits that can be set are :
•
•
•
•
LT = less than
GT = greater than
EQ = equal
OV = overflow (a.k.a. carry bit)
Continued on next page
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-11
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power CPU Overview
Branch
instructions
-- continued
All instructions beginning with a “b” are branches. They change the
address for the next instruction to be run.
They have three addressing modes :
• Absolute - goes to an explicit address
• Relative - target address is an offset from current instruction
address
• Register - Only two registers can contain a branch target : Count
(CTR) and Link (LR)
Branches can be conditional. That depends upon whether the option bit
matches the specified bit in the CR. A branch instruction can specify
which CR to use, where CR0 is assumed unless otherwise specified.
Extended mnemonics are defined by the assembler to cover most
combinations
The conditional branch instruction is central to any computer
architecture. However, most architectures (including POWER and
PowerPC) avoid putting comparisons directly into their branch
instructions (to keep things simple). They provide compare instructions
that set “condition bits.” These bits are what are used on branch
instructions to make the actual decision.
The assembler (and crash’s disassembler) provides extended
mnemonics that combine a type of branch and the condition register bit
that determines whether the branch is taken. Another bit in the branch
opcode determines whether the CR bit must be on or off for the branch to
take place. This bit is also incorporated into the extended mnemonics
(the “not” versions of the branches). For maximum flexibility, the
assembler usually also allows you to specify the “not” cases as the
logically-opposite case. For example, bnl (branch not less than) can also
be written as bge (branch greater than or equal to). Either case is still
saying, “branch if the LT bit is turned off.”
Examples
• EOW[F # Branches to address 38c00 if LT bit is on in CR0
• EJHFU[ # Branches if LT bit is off in CR3
• EQHOUFU # Branches to address in LR if EQ bit is off in CR7
• EOHDFU[ # Branches to absolute address 0x3600 if
GT bit is off in CR2
Continued on next page
-12
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for Review October 15, 2000 12:13 pm
Instructor Guidepower_hardware_overview.fm
Power CPU Overview
Trap
instructions
-- continued
Most mnemonics beginning with a “t” are traps, and generate a program
exception if the specified condition is met. There are two variations of the
trap instruction :
• t or tw - compares two registers, traps if specified comparison is
true
• ti or twi - compares register to immediate value instead
“w” mnemonics are the PowerPC indication that these trap instructions
are working on 32-bit values. As with branches, there are extended
mnemonics defined to provide various traps. In this context ‘lt’, ‘gt’, ‘eq’,
etc. have same meaning as on branch mnemonics
Examples
• WZHTUU# Traps if r3 equals r4
• WZQHLU # Traps if r31 is not equal to 0
Trap instructions are the only instructions in this architecture that perform
a comparison and take some action, all in one instruction. They do not
set or use condition register bits.
Special
register
operations
The Special Purpose Registers (SPRs) can only be copied to or from
GPRs.
• PIVSUU # Copies SPR 8 into r3
• PWVSUU # Copies r3 into SPR 9
Extended mnemonics are defined to cover common SPRs :
• PIOUU# Copies the LR (SPR 8) into r3
• PWFWUU # Copies r3 into the CTR (SPR 9)
Continued on next page
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-13
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
Power CPU Overview
Interrupt
vectors
Interrupt vectors are addresses of short sections of code which saves
the state of the processor and then branches to a handler routine.
Some examples are :
•
•
•
•
•
•
•
•
•
•
•
-14
-- continued
system reset - vector 0x100
machine Check - vector 0x200
data storage interrupt (DSI) - vector 0x300
instruction storage interrupt (ISI) - vector 0x400
external interrupt - 0x500
alignment - vector 0x600
program (invalid instruction or trap instruction) - vector 0x700
floating-point unavailable - vector 0x800
decrementer - vector 0x900
system call - vector 0xc00
There are some exceptions unique to each type of processor.
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for Review October 15, 2000 12:13 pm
Instructor Guidepower_hardware_overview.fm
64 bit CPU Overview
64-bit
hardware
characteristics
With full hardware 32-bit binary compatibility as the baseline, the
features that characterize a PowerPC processor as 64-bit include:
• 64-bit general registers
• 64-bit instructions for loading and storing 64-bit data operands, and
for performing 64-bit arithmetic and logical operations.
• two execution modes: 32-bit and 64-bit. Whereas 32-bit processors
have implicitly only one mode of operation, 32-bit execution mode
on a 64-bit processor causes instructions and addressing to
behave the same as on a 32-bit processor. As a separate mode,
64-bit execution mode creates a true 64-bit environment, with 64-bit
addressing and instruction behavior.
• 64-bit physical memory addressing facilities
• additional supervisor instructions, as needed to set up and control
the execution mode. A key feature the PowerPC 64-bit architecture
provides is execution mode on a per-process level, helping AIX to
create, at the system level, a mixed environment of concurrent
32-bit and 64-bitprocesses.
The Machine Status Register (MSR) bit controls 32-bit or 64-bit
execution mode :
• Allows support for 32-bit processes on 64-bit hardware
• Used by the kernel to run in 32-bit mode in kernel
• portions of the VMM run in 64-bit mode on 64-bit hardware (to
address large tables to represent large virtual memory)
• 32-bit mode on 64-bit hardware looks exactly like 32-bit hardware
(ensures binary compatability for 32-bit applications)
• 32-bit instructions use only bottom 32-bits of registers for data or
addresses
Continued on next page
© Copyright IBM Corp. 2000
Unit .
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-15
power_hardware_overview.fmInstructor Guide
Draft Version for Review October 15, 2000 12:13 pm
64 bit CPU Overview
Segment table
-- continued
The 64-bit virtual address space is represented with a segment table,
which acts as an in-memory set-associative cache of the most recently
used 256 segment number to segment ID mappings. The current
segment table is pointed to with the 64 bit Address Space Register
(ASR) register. The ASR has a valid bit to indicate that no segment table
is valid. This is used in 32-bit mode on 64-bit processors to indicate that
the segment table is not being used.
IBM "bridge extensions" to PowerPC 64-bit architecture allow segment
register operations to work for 32-bit mode. It allows the kernel to
continue to manipulate segment registers. The "bridge extensions" are
used to load and store "segment registers" instead.
A Segment Lookaside Buffer (SLB) is used to cache recently used
segment number to segment ID mappings. This is similar to Translation
Lookaside Buffer (TLB) for page to frame translations
The SLB is similar to segment table but smaller and faster (on chip, not in
memory)
-16
Course short title
© Copyright IBM Corp. 2000
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
Guide
Unit 4. SMP Hardware Overview
Objectives
The Objectives for this lesson are:
• list the three types op multiprocessor design
• describe what is meant MP safe
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 6
Guide
Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
SMP Hardware Overview
Symmetric
multiprocessing
On uniprocessor systems, bottlenecks exist in the form of the address and
data bus restricting transfers to one at a time and the other program
counter forcing instructions to be executed in strict sequence. Some
performance improvement was achieved by constantly improving the
speeds of these uniprocessor machines.
With symmetric multiprocessing, more than one CPU work together.
There are several categories of MP systems depending on whether the
CPU share resources, have their own resources (like memory, operating
system, I/O channels, control units, files and devices), how they are
connected (whether in a single machine sharing a single bus or in different
machines), whether all processors are functionally equal or some are
specialized.
Types of Multiprocessors:
• Loosely-coupled MP
• Tightly-coupled MP
• Symmetric MP
Loosely
coupled MP
Has different systems on a communication link with the systems fuctioning
independently and communicating when necessary.
The separate systems can access each other’s files and maybe even
download tasks to the lightly loaded CPU to achieve some load balancing.
Tightly
coupled MP
Uses a single storage shared by the various processors and a single
operating system that controls all the processors and system hardware.
Symmetric MP
All of the processors are functionally equivalent and can perform I/O and
computation.
Continued on next page
-2 of 6 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
SMP Hardware Overview
Multiprocessor
organization
Guide
-- continued
In order to have all CPU’s work together, there must be some sort of
organization. There are three ways to do that :
• Master/slave multiprocessing organization.
• Separate executives organization.
• Symmetric multi-processing organization.
Master slave
organization
One processor is designated as the master and the others are the slaves.
The master is a general purpose processor and performs input/output as
well as computation. The slave processors perform only computation.
The processors are considered asymmetric (not equivalent) since only the
master can do I/O as well as computation. Utilization of a slave may be
poor if the master does not service slave requests efficiently enough.
Another disadvantage may be I/O-bound jobs, because they may not run
efficiently since only the master does I/O.
Separate
executives
organization
With this organization each processor has its own operating system and
responds to interrupts from users running on that processor. A process is
assigned to run on a particular processor and runs to completion.
It is possible for some of the processors to remain idle while other
processors execute lengthy processes. Some tables are global to the
entire system and access to these tables must be carefully controlled.
Each processor controls its own dedicated resources, such as files and I/O
devices.
Symmetric
multiprocessing
organization
All of the processors are functionally equivalent and can perform I/O and
computation. The operating system manages a pool of identical
processors, any one of which may be used to control any I/O devices or
reference any storage unit. Conflicts between processors attempting to
access the same storage at the same time are ordinarily resolved by
hardware. Multiple tables in the kernel can be accessed by different
processes simultaneously. Conflicts in access to systemwide tables are
ordinarily resolved by software. A process may be run at different times by
any of the processors and at any given time, several processors may
execute operating system function in kernel-mode.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 6
Guide
Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
SMP Hardware Overview
Multiprocessor
definitions
-- continued
There are two ways of identifying separate processors. You can identfiy
them by :
• the physical CPU number
• the logical CPU number
The lowest number will start from ‘0’ on Power systems, but will start
from ‘1’ on IA-64.
Where the physical numbers identify all processors on the system,
regardless of their state, and the logical numbers identify enabled
processors only. The Object Data Manager (ODM) names for processors
are based on physical numbers with the prefix /proc. The table below
illustrates these naming scheme for a three-processor Power system.
ODM name
/proc0
/proc1
/proc2
Physical
number
0
1
2
Logical
number
0
1
Processor
state
Enabled
Disabled
Enabled
Continued on next page
-4 of 6 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
SMP Hardware Overview
Funneling
Guide
-- continued
In order to run some Uni-Processor device drivers unchanged because
they are not ‘thread-safe’ or ‘MP safe’, their execution had to be “funneled”
through one specific processor, which is called the MP master. Funneled
code runs only on the master processor; therefore, the current
uniprocessor serialization is sufficient.
One processor will be known as the default, or Master processor and this
concept is used for funneling. It is not a master processor in the sense of
master/slave processing - the term is used only to designate which
processor will be the default processor. It’s defined by the value of
MP_MASTER in the <sys/processor.h> file
Note : funneling is NOT supported by the 64-bit kernel !!!
Funneling has the following characteristics :
• Interrupts for a funneled device driver will be routed to the MP master
CPU.
• Funneling is intended to support third party device driver and lowthroughput device drivers.
• The base kernel will provide binary compatibility for these device
drivers.
• Funneling only works if all references to the device driver are through
the device switch table.
MP safe
MP safe code will run on any processor. It’s modified to prevent resource
clashes by adding locking code in order to serialize its execution.
MP efficient
MP efficient code is MP safe code, but has also some data locking
mechanisms to serialize data access. This way it will be easier to spread
whatever the code does across the availables CPUs.
MP efficient device drivers are intended for high-throughput device drivers.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 6
Guide
-6 of 6 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, smp_hardware_overview.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Unit 5. Configuring System Dumps on AIX 5L
This lesson describes how to configure and take system dumps on a
node running the AIX5L operating system.
What You Should Be Able to Do
After completing this unit, you should be able to
• Configure an AIX5L system to take a system dump
• Test the system dump configuration of an AIX5L system
• Verify the validity of a dump file
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-1 of 28
Guide
-2 of 28 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
About This Lesson
Purpose
This lesson describes how to configure and take system dumps on a node
running the AIX5L operating system.
Objectives
At the completion of this lesson, you will be able to:
• Configure an AIX5L system to take a system dump
• Test the system dump configuration of an AIX5L system
• Verify the validity of a dump file
Table of
contents
This lesson covers the following topics:
Topic
About This Lesson
System Dump Facility in AIX5L
Configuring for System Dumps
Obtaining a Crash Dump
Dump Status and completion codes
dumpcheck utility
Verify the dump
Packaging the dump
See Page
3
5
7
16
17
19
21
26
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-3 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
About This Lesson
-- continued
Estimated
length
This lesson takes approximately 1hour to complete.
Accountability
You will be able to measure your progress with the following:
• Exercises using your lab system.
• Check-point activity
• Lesson review
Reference
Redbooks
Organization
of this lesson
This lesson consists of information followed by exercises that allow you to
practice what you’ve just learned. Sometimes, as the information is being
presented, you are required to do something - pull down a menu, enter a
response, etc. This symbol, in the left hand side-head, is an indication that
an action is required.
-4 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
System Dump Facility in AIX5L
Introduction
An AIX5L system can generate a system dump (or crash dump) due to
encountering a severe system error, such as an exception in kernel mode
that was unexpected or that the kernel cannot handle. It can also be
initiated by the system administrator when the system has hung.
When an unexpected system halt occurs, the system dump facility
automatically copies selected areas of kernel data to the primary dump
device. These areas include kernel segment 0 as well as other ares
registered in the Master Dump Table by kernel modules or kernel
extensions. The system dump is a snapshot of the operating system state
at the time of the crash or manually initiated dump.
The system dump facility provides a mechanism to capture sufficient
information about the AIX5L kernel for later analysis. Once the preserved
image is written to disk, the system will be booted and returned to
production.
Analysis of the dump can be done on another machine away from the
production machine at a convenient time, or location by a skilled kernel
person.
Process
The process of taking a system dump is illustrated in the following chart.
The process involves a two stages, in stage one the contents of memory is
copied into a temporary disk location. In stage two, AIX5L is booted and
the memory image is moved to a permanent location in the /var/adm/ras
directory.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-5 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
System Dump Facility in AIX5L
-- continued
Process
continued
AIX5L in production
Copycore copies dump
into /var/adm/ras
Stage 1
- copycore started in
rc.boot
System panics
System is booted
Memory Dumper Run
Stage 2
- memory is copied to disk location
specified in SWservAt ODM object
class
-6 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Configuring for System Dumps
Introduction
When the operating system is installed, parameters regarding the dump
device are configured with default settings. To ensure that a system dump
is taken successfully, the system dump parameters need to be configured
properly.
The system dump parameters are stored in system configuration objects
within the SWservAt ODM object class. Objects within the SWservAt
object class define where and how a system dump should be handled.
SWservAt
object class
The SWservAt ODM object class is stored in the /etc/objrepos directory.
Objects included within the object class are:
name
tprimary
default
/dev/hd6
description
primary
/dev/hd6
Defines the temporary primary dump
device. By default this is the primary
paging space logical volume, hd6.
tsecondary
/dev/sysdumpnull
Defines the permanent secondary
dump device. By default this is the
device sysdumpnull.
secondary
/dev/sysdumpnull
Defines the temporary secondary
dump device. By default this is the
device sysdumpnull.
autocopydump
/var/adm/ras
Defines the directory the dump is
copied to at system boot.
forcecopydump
TRUE
TRUE - If a the copy fails to the copy
directory, the system boot process
will bring up a utility to copy the dump
to removable media.
enable_dump
FALSE
FALSE - Disables the ability to force
a sysdump using the dump key
sequence or the reset button on
systems without a key mode switch.
dump_compress
OFF
OFF - specifies that dumps will not be
compressed.
Defines the permanent primary dump
device. By default this is the primary
paging space logical volume, hd6.
Each object can be changed with the use of the sysdumpdev command.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-7 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Configuring for System Dumps
sysdumpdev
The sysdumpdev command changes the settings of SWservAt objects.
The command provides you with the ability to:
•
•
•
•
•
•
Dump Device
selection rules
-- continued
Estimate the size of the system dump
Selecting the primary and secondary dump devices
Selecting the directory the dump will be copied to at boot
Displaying information from the previous dump invocation
Determine if a new system dump exists
Display current dump settings
When selecting the primary or secondary dump device the following rules
must be observed:
• A mirrored paging space may be used as a dump device.
• Do not use a diskette drive as your dump device.
• If you use a paging device, only use hd6, the primary paging device.
Continued on next page
-8 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Configuring for System Dumps
Preparing for a
system dump
-- continued
To ensure that a system dump will be successfully captured, complete the
following steps:
Step
1.
Action
Estimate the size of the dump. This can be done through smit
by following the fast path:
# smit dump_estimate
Or, using the sysdumpdev command:
# sysdumpdev -e
(With Compression turned on)
0453-041 Estimated dump size in bytes:11744051
(With Compression turned off)
0453-041 Estimated dump size in bytes:58720256
Using the above example, the dump will require 12MB (with
compression on), or 59MB (with compression turned off) of
device storage. This value can change based on the activity of
the system. It is best to run this command when the machine is
under its heaviest workload. Size the dump device four times
the value reported by the sysdumpdev command in order to
handle a system dump during peak system activity.
IA-64 Systems - Compression must be turned off to gather
a valid system dump. (Eratta)
DUMPSPACE requirement for this system:
______MB * 4 = ______MB
Note: On AIX5L a new utility called dumpcheck has been
created to monitor the system and verify that if a system dump
occurred that the resources are properly configured to the
system dump. The utility is run as a cron job, and is located in
the /usr/lib/ras directory. The time when the command is
scheduled to run should be adjusted to when the peak system
load is expected. Any warnings will be logged in the errorlog.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-9 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Configuring for System Dumps
-- continued
Preparing for a
system dump
continued
Step
2
Action
Create a primary dump device named dumplv. Calculate the
required number of PPs for the dump device. Get the PP size
of the volume group by using the lsvg command:
# lsvg rootvg
VOLUME GROUP:
rootvg VG IDENTIFIER: db1010a
VG STATE:
active PP SIZE: 16 megabyte(s)
VG PERMISSION: read/write
TOTAL PPs: 1626
(26016 megabytes)
MAX LVs:
256
FREE PPs: 1464 (23424
megabytes)
LVs:
11
USED PPs: 162 (2592
megabytes)
OPEN LVs:
8
QUORUM:
2
TOTAL PVs:
3
VG DESCRIPTORS: 3
STALE PVs:
0
STALE PPs:
0
ACTIVE PVs:
3
AUTO ON:
yes
MAX PPs per PV: 1016
MAX PVs:
32
LTG size:
128 kilobyte(s) AUTO SYNC: no
HOT SPARE:
no
Determine the necessary number of PPs by dividing the
estimated size of the dump by the PP size. For example:
236MB (59*4) / 16MB = 14.75 (required number
is 15)
Create a logical volume of the required size, for example:
#mklv -y dumplv -t sysdump rootvg 15
Continued on next page
-10 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Configuring for System Dumps
-- continued
Preparing for a
system dump
continued
Step
3.
Action
Verify the size of the device /dev/dumplv.
Enter the following command:
# lslv dumplv
LOGICAL VOLUME:
dumplv
VOLUME GROUP:
rootvg
LV IDENTIFIER: e59bd8 PERMISSION: read/write
VG STATE: active/complete
LV STATE:opened/syncd
TYPE: dump
WRITE VERIFY: off
MAX LPs:512
PP SIZE:
16 megabyte(s)
COPIES: 1
SCHED POLICY:
parallel
LPs: 15
PPs: 15
STALE PPs: 0
BB POLICY: relocatable
INTER-POLICY: minimum
RELOCATABLE:
no
INTRA-POLICY: middle
UPPER BOUND:
32
MOUNT POINT: N/A
LABEL:
None
MIRROR WRITE CONSISTENCY: off
EACH LP COPY ON A SEPARATE PV?: yes
4.
In this example, the dumplv logical volume contains 15 16MB
partitions giving a total size of 240MB.
Assign the primary dump device by using the sysdumpdev
command:
#sysdumpdev -s /dev/dumplv -P
primary
/dev/dumplv
secondary
/dev/sysdumpnull
copy directory
/var/adm/ras
forced copy flag
FALSE
always allow dump
FALSE
dump compression
OFF
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-11 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Configuring for System Dumps
-- continued
Preparing for a
system dump
continued
Step
5.
Action
Create a secondary dump device. The secondary dump device
is used to back up the primary dump device. If an error occurs
during a system to dump to the primary dump device, the
system attempts to dump to the secondary device (if it is
defined).
Create a logical volume of the required size, for example:
6.
#mklv -y hd7 -t sysdump rootvg 15
Assign the secondary dump device by using the sysdumpdev
command:
#sysdumpdev -s /dev/hd7 -P
primary
/dev/dumplv
secondary
/dev/hd7
copy directory
/var/adm/ras
forced copy flag
FALSE
always allow dump
FALSE
dump compression
OFF
Continued on next page
-12 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Configuring for System Dumps
Guide
-- continued
Preparing for a
system dump
continued
Step
7.
Action
Verify the size of the filesystem containing the copy directory is
large enough to handle a crash dump. Check the size of the
copy directory filesystem with the following command:
#df -k /var
Filesystem
1024-blocks
Free%Used
Iused %Iused Mounted on
/dev/hd9var
32768
31268
5%
143
64% /var
In this example the /var filesystem is 32MB. To increase the
size of the /var filesystem to 240MB, use the following
command:
# chfs -asize=+240000 /var
Note: The default copy directory is /var/adm/ras. The rc.boot
script is coded to check and mount the /var filesystem to
support the copy of the system dump out of the dump device. If
an alternate location is selected modification of /sbin/rc.boot
maybe necessary. Also you will be required to update the ram
filesystem with the bosboot command.
Portion of /sbin/rc.boot:
# Mount /var for copycore
echo "rc.boot: executing \"fsck -fp var\"" \
>>/../tmp/boot_log
fsck -fp /var
echo "rc.boot: executing \"mount /var\"" \
>>/../tmp/boot_log
mount /var
[ $? -ne 0 ] && loopled 0x518
# retrieve dump
echo "rc.boot: executing \"copycore\"" \
>>/../tmp/boot_log
copycore
umount /var
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-13 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Configuring for System Dumps
-- continued
Preparing for a
system dump
continued
Step
8.
9.
10.
Action
Configure the force copy flag. If paging space is being used
as a dump device, the force copy flag must be set to TRUE.
This will force the system boot sequence into menus that
allow copy of the dump to external media if the copy to the
copy directory fails. This utility will give you the opprotunity
to save the crash to removable media if the default copy
directory is full or un-available. To set the flag to TRUE, use
the following command:
# sysdumpdev -PD /var/adm/ras
Configure the allow system dump flag. To enable the reset
button or dump key sequence to force a dump sequence
with the key in the normal position, or on a machine without
a key mode switch, the allow system dump flag must be set
to TRUE. To set the flag TRUE, use the following
command:
# sysdumpdev -KP
Configure the compression flag. To enable compression of
the system dump prior to being written to the dump device,
the compression flag must be set to ON. To set the flag to
ON, use the following command:
# sysdumpdev -CP
IA-64 Systems - Compression must be turned off to
gather a valid system dump. (Eratta):
# sysdumpdev -cP
Note: Turning the compression flag on will
cause the dump to be saved in a compressed
form on the primary dump device. Also, the
copycore utility will generate a compressed
vmcore file, vmcore.x.Z.
Continued on next page
-14 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Configuring for System Dumps
Guide
-- continued
Preparing for a
system dump
continued
Step
11.
Action
Configure the system for autorestart. A useful system attribute
is autorestart. If autorestart is TRUE, the system will
automatically reboot after a crash. This is useful if the machine
is physically distant or often unattended. To list the system
attributes, use the following command:
# lsattr -El sys0
To set autorestart to TRUE, use SMIT by following the fast
path:
# smit chgsys
Or use the command:
# chdev -l sys0 -a autorestart=’true’
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-15 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Obtaining a Crash Dump
Introduction
AIX5L has been designed to automatically collect a system crash dump
following a system panic. This section discusses the operator controls and
procedure that is used to obtain a system dump.
User initiated
dumps
Under unattended hang conditions or for other debugging purposes
system administrator may use different techniques to force a dump:
• Using sysdumpstart -p command (primary dump device) or
sysdump -s command (secondary dump device).
• Start a system dump with the Reset button by doing the following (this
procedure works for all system configurations and will work in
circumstances where other methods for starting a dump will not):
Step
1.
2.
Action
Turn the machine’s mode switch to the Service position, or
set Always Allow System Dump to TRUE.
Press the Reset button. The system writes the dump
information to the primary dump device.
Power PC - Pressing the Ctlr-Alt 1 key sequence to write the dump
information to the primary dump device, or press the Ctlr-Alt 2 key
sequence to write the dump information to the secondary dump device.
IA-64 - Pressing the Ctlr-Alt-NUMPAD1 key sequence to write the dump
information to the primary device, or Ctlr-Alt-NUMPAD2 key sequence to
write the dump information to the secondaray dump device.
-16 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Dump Status and completion codes
Progression
status codes
A system crash will cause a number of status codes to be displayed. When
a system has crashed, the LEDs will display a flashing 888. The system
may display the code 0c9 for a short period of time, indicating a system
dump is in progress. When the dump is complete, the dump status code
will change to 0c0 if the system was able to dump successfully.
If the Low-Level Debugger (LLDB) is enabled, a c20 will appear in the
LEDs, and an ASCII terminal connected to the s1 or s2 serial port will
show an LLDB screen. Typing quit dump will initiate a dump.
During the dump process, the following progression status codes may be
seen on the LED or LCD displays:
LED code
0c8
sysdumpdev status
Description
0
Dump successful
-4
I/O error during dump.
-2
Dump device is too
small. Partial dump
taken.
-3
Internal dump error. It
shows only when the
dump facility itself fails.
This does not include the
failure of dump
component routines.
-1
No dump device defined.
0c2
N/A
0c6
N/A
0c9
N/A
0cc
N/A
Flashing 888
102
N/A
N/A
0c0
0c1
0c4
0c5
User-initiated dump in
progress.
User-initiated dump in
progress to secondary
dump device.
System initiated dump in
progress.
Dump process switched
to secondary dump
device.
System has crashed
This value indicates an
unexpected system halt.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-17 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Dump Status and completion codes
LED code
nnn
000
2xx
-- continued
sysdumpdev status
Description
N/A
This value is the
cause of the system
halt (reason code)
N/A
Unexpected system
interrupt (hardware
related)
N/A
Machine check
Error log
If the dump was lost or did not save during system boot, the error log can
help determine the nature of the problem that caused the dump. To check
the error log, use the errpt command.
Create a user
initiated dump
Create a test dump by entering the following command:
Step
1.
# sysdumpstart -p
2.
IA-64 Systems - For a dump that is approximately 120MB
in size wait for approximately 15 minutes before shutting
down the machine.
Reboot the system.
-18 of 28 AIX 5L Internals
Action
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
dumpcheck utility
Description
The /usr/lib/ras/dumpcheck utility is used to check the disk resources used
by the system dump facility. The command logs an error if either the
largest dump device is too small to receive the dump or there is insufficient
space in the copy directory when the dump device is a paging space.
Requirements
In order to be effective, the dumpcheck utility must be:
• Enabled:
• To verify if dumpcheck has been enabled by using the following
command:
# crontab -l | grep dumpcheck
0 15 * * * /usr/lib/ras/dumpcheck >/dev/null 2>&1
• Enable the dumpcheck utility by using the -t flag. This will create an
entry in the root crontab if none exists. Example, to set the dumpcheck
utility to run at 2PM:
# /usr/lib/ras/dumpcheck -t
“0 14
*
*
*”
• Dumpcheck should be run at the time the system is heavily loaded in
order to find the maximum size the dump will take. The default time is
set for 3PM.
dumpcheck
overview
Dumpcheck utility will do the following when enabled :
•
•
•
•
•
Estimate the dump or compressed dump size using sysdympdev -e
Find the dump logical volumes and copy directory using sysdumpdev -l
Estimate the primary and secondary dump device sizes
Estimate the copy directory free space
If the dump device is a paging space, dumpcheck will verify if the free
space in the copy directory is large enough to copy the dump
• If the dump device is a logical volume, dumpcheck will verify it is large
enough to contain a dump
• If the dump device is a tape, dumpcheck will exit without message.
Any time a problem is found, dumpcheck will log a entry in the error log
and, if the -p flag is present, will display a message to stdout for crontab,
that mean it will mail the stdout to the root user.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-19 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
dumpcheck utility
Error log entry
sample
-- continued
The following is an example of an errorlog entry created by the dumpcheck
utility because of lack of space in the primary and secondary dump
devices:
---------------------------------------------------LABEL:
DMPCHK_TOOSMALL
IDENTIFIER:
E87EF1BE
Date/Time:
Sequence Number:
Machine Id:
Node Id:
Class:
Type:
Resource Name:
Tue Aug 15 09:49:41 CDT
45
000714834C00
wcs2
O
PEND
dumpcheck
Description
The largest dump device is too small.
Probable Causes
Neither dump device is large enough to accommodate a
system dump at this time.
Recommended Actions
Increase the size of one or both dump devices.
Detail Data
Largest dump device
testdump
Largest dump device size in kb
8192
Current estimated dump size in kb
65536
----------------------------------------------------
-20 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Verify the dump
Description
Before submitting a dump to IBM for analysis, it is important to verify that
the dump is valid and readable.
Locating the
dump
To locate the dump issue the following command:
# sysdumpdev -L
The following output shows a good dump:
0453-039
Device name:
/dev/dumplv
Major device number: 10
Minor device number: 2
Size:
8837632 bytes
Uncompressed Size:
32900935 bytes
Date/Time:
Fri Sep 22 13:01:41 PDT 2000
Dump status:
0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0.Z
In this case a valid dump was safely save by the system in the /var/adm/
ras directory.
The following case shows the command output when the copy failed.
Presumably the dump is available on the external media device, for
example, tape.
0453-039
Device name:
/dev/dumplv
Major device number: 10
Minor device number: 2
Size:
8837632 bytes
Uncompressed Size: 32900935 bytes
Date/Time:
Dump status:
Fri Sep 22 13:01:41 PDT 2000
0
dump completed successfully
0481-195 Failed to copy the dump from /dev/dumplv to /var/adm/ras.
0481-198 Allowed the customer to copy the dump to external media.
Note: A dump saved on Initial Program Load (IPL) to external media is not
sufficient for analysis. Additional files are required.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-21 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Verify the dump
-- continued
Dump analysis
tools
To verify the dump is valid, the dump must be examined by a kernel
debugger. The kernel debugger used to validate the dump depends on the
system architecture. If the system is running on Power PC, the debugger is
kdb. The kernel debugger for IA-64 platforms is iadb.
Verifying the
dump
The following procedure should be used to verify the dump
Step
1.
Action
Locate the crash dump:
# sysdumpdev -L
0453-039
Device name:
/dev/dumplv
Major device number: 10
Minor device number: 2
Size:
8837632 bytes
Uncompressed Size:
32900935 bytes
Date/Time:
Dump status:
Fri Sep 22 13:01:41 PDT 2000
0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.0.Z
2.
3.
Change directory to the dump location. In the above
example:
# cd /var/adm/ras
Decompress the vmcore file if necessary:
# uncompress vmcore.0.Z
Continued on next page
-22 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Verify the dump
-- continued
Verifying the
dump
continued
Step
4.
Action
Start the kernel debugger;
Power PC:
# kdb /var/adm/ras/vmcore.0
The specified kernel file is a UP kernel
vmcore.1 mapped from @ 70000000 to @ 71fdba81
Preserving 880793 bytes of symbol table
First symbol __mulh
KERNEXT FUNCTION NAME CACHE (90112 bytes) allocated
KERNEXT COMMANDS SPACE (4096 bytes) allocated
Component Names:
1) dmp_minimal [5 entries]
....
Dump analysis on CHRP_UP_PCI POWER_PC POWER_604
machine with 1 cpu(s) (32-bit r
egisters)
Processing symbol table...
.......................done
(0)>
IA-64:
# iadb /var/adm/ras/vmcore.0
symbol capture using file: /unix
iadb: Probing a live system, with memfd as :4
Current Context:
cpu:0x1, thread slot: 77, process Slot: 51,
ad space: 0x8e44
thrd ptr: 0xe00000972a13b000, proc ptr:
e00000972a12e000
mst at:3ff002ff3b400
(1)>
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-23 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Verify the dump
-- continued
Verifying the
dump
continued
Step
5.
Action
Issue the stat subcommand to verify the details of the dump.
Ensure the values are consistent with the dump that was
taken.
Power PC:
(0)> stat
SYSTEM_CONFIGURATION:
CHRP_UP_PCI POWER_PC POWER_604 machine with 1
cpu(s) (32-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. kca41
release... 0
version... 5
machine... 000930134C00
nid....... 0930134C
time of crash: Thu Oct 5 10:37:57 2000
age of system: 3 min., 11 sec.
xmalloc debug: disabled
IA-64:
(1)>stat
SYSTEM_CONFIGURATION:
IA64 machine with 2 cpu(s)(64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. kca40
hostname.. kca40.hil.sequent.com
release... 0
version... 5
machine... 000000004C00
nid....... 0000004c
current time: Fri Oct 6 12:20:56 2000
age of system: 1 day, 1 hr., 1 min., 43 sec.
xmalloc debug: disabled
Continued on next page
-24 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Guide
Verify the dump
-- continued
Verifying the
dump
continued
Step
6.
Action
Exit the kernel debugger:
Power PC:
(0) > q
IA-64:
(1) > q
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-25 of 28
Guide
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Packaging the dump
Overview
Once a valid dump has been identified, the next step is to package the
dump to be send in for analysis.
Packaging the
dump
The following procedure will automatically collect the required files
pertaining to the system dump
Step
1.
2.
Action
Compress the vmcore file:
# compress /var/adm/ras/vmcore.0
Gather all of the files and information regarding the dump
using the following command:
# snap -Dkg
Checking space requirement for general
information............................................
........... done.
Checking space requirement for kernel
information.......... done.
Checking space requirement for dump information.....
done.
Checking for enough free space in filesystem... done.
********Checking and initializing directory structure
Creating /tmp/ibmsupt directory tree... done.
Creating /tmp/ibmsupt/dump directory tree... done.
Creating /tmp/ibmsupt/kernel directory tree... done.
Creating /tmp/ibmsupt/general directory tree... done.
Creating /tmp/ibmsupt/general/diagnostics directory
tree... done.
Creating /tmp/ibmsupt/testcase directory tree... done.
Creating /tmp/ibmsupt/other directory tree... done.
********Finished setting up directory /tmp/ibmsupt
Gathering general system
information........................done.
Gathering kernel system information........... done.
Gathering dump system information...... done.
Continued on next page
-26 of 28 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Packaging the dump
Guide
-- continued
Packaging the
dump
continued
Step
3.
Action
Copy the dump to external media. To copy the gathered
files to the /dev/rmt0 tape device, issue the following
command:
# snap -o /dev/rmt0
Once this command completes, the tape can be removed
and sent in for analysis. Write protect the tape and label
appropriately
Packaging a
dump stored
on external
media
A dump saved to external media needs to be gathered with other files to
provide a dump which is readable. To gather and pack the files follow the
following steps:
Step
1.
Action
Create a skeleton directory to contain the dump information.
# snap -D
2.
3.
This will fail stating the dump device is no longer valid.
Overcome this by restoring the dump from the media used on
IPL to save the dump.
Restore the dump from external media. For example, a dump
saved to the /dev/rmt0 device is restored by commands:
# cd /tmp/ibmsupt/dump
# tar -xvf /dev/rmt
# mv dump_file dump
Copy the dump to external media. To copy the gathered files
to the /dev/rmt0 tape device, issue the following command:
# snap -o /dev/rmt0
Once this command completes, the tape can be removed and
sent in for analysis. Write protect the tape and label
appropriately
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-27 of 28
Guide
-28 of 28 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, crashdump.fm
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
Unit 6. Introduction to Dump Analysis Tools
This lesson describes the different tools that are available to debug a
system dump taken from an AIX5L system.
What You Should Be Able to Do
After completing this unit, you should be able to:
At the completion of this lesson, you will be able to:
• Describe available tools for system dump analysis
• Invoke the IADB/iadb and KDB/kdb kernel debuggers
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-1 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
About This Lesson
Purpose
This lesson describes the different tools that are available to debug a
system dump taken from an AIX5L system.
Prerequisites
You should have completed the following lesson:
• Configuring System Dumps on AIX5L
Objectives
At the completion of this lesson, you will be able to:
• Describe available tools for system dump analysis
• Invoke the IADB/iadb and KDB/kdb kernel debuggers
Table of
contents
This lesson covers the following topics:
Topic
About This Lesson
System Dump Analysis Tools
dump components
Dump creation process
Component dump routines
bosdebug command
Memory Overlay Detection System
System Hang Detection
truss command
KDB kernel debugger
kdb command
KDB miscellaneous sub commands
KDB dump/display/decode sub commands
KDB modify memory sub commands
KDB trace sub commands
KDB break point and step sub commands
KDB name list/symbol sub commands
See Page
3
7
8
9
10
11
12
15
21
24
26
27
30
34
37
39
43
Continued on next page
-2 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
About This Lesson
-- continued
Table of
contents
continued
Topic
KDB watch break point sub commands
KDB machine status sub commands
KDB kernel extension loader sub commands
KDB address translation sub commands
KDB process/thread sub commands
KDB Kernel stack sub commands
KDB LVM sub commands
KDB SCSI sub commands
KDB memory allocator sub commands
KDB file system sub commands
KDB system table sub commands
KDB network sub commands
KDB VMM sub commands
KDB SMP sub commands
KDB data and instruction block address translation sub
commands
KDB bat/brat sub commands
IADB kernel debugger
iadb command
See Page
44
46
48
50
51
59
61
63
66
70
73
78
81
87
88
90
91
93
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-3 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
About This Lesson
-- continued
Table of
contents
continued
Topic
IADB break point and step sub commands
IADB dump/display/decode sub commands
IADB modify memory sub commands
IADB name list/symbol sub commands
IADB watch break point sub commands
IADB machine status sub commands
IADB kernel extension loader sub commands
IADB address translation sub commands
IADB process/thread sub commands
IADB LVM sub commands
IADB SCSI sub commands
IADB memory allocator sub commands
IADB file system sub commands
IADB system table sub commands
IADB network sub commands
IADB VMM sub commands
IADB SMP sub commands
IADB block address translation sub commands
IADB bat/brat sub commands
IADB miscellaneous sub commands
Exercise
See Page
94
97
101
106
107
109
111
112
113
115
116
117
118
119
120
121
123
124
125
126
128
Continued on next page
-4 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
About This Lesson
Guide
-- continued
Estimated
length
This lesson takes approximately 1.5 hours to complete.
Accountability
You will be able to measure your progress with the following:
• Exercises using your lab system.
• Check-point activity
• Lesson review
Reference
• AIX5L docs
Organization
of this lesson
This lesson consists of information followed by exercises that allow you to
practice what you’ve just learned. Sometimes, as the information is being
presented, you are required to do something - pull down a menu, enter a
response, etc. This symbol, in the left hand side-head, is an indication that
an action is required.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-5 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
System Dump Analysis Tools
Introduction
AIX5L introduces new debugging tools, the main change from the previous
releases of AIX is that the crash command has been replaced by:
• IADB and KDB kernel debuggers for live systems debugging
• iadb and kdb commands for system image analysis
In addition the following tools/commands are available, to assist you with
debug:
• bosdebug
• Memory Overlay Detection System (MODS)
• System Hang Detection
• truss
Typographic
conventions
In the following sections we will use uppercase IADB and KDB when
speaking about the live kernel debuggers, and lowercase iadb and kdb
when speaking about the commands.
-6 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
dump components
Introduction
In AIX5L, a dump image is not actually a full image of the system memory but a
set of memory areas dumped by the dump process
The Master
dump Table
A master dump table entry is a pointer to a function provided by the kernel
extension that will be called by the kernel dump routine when a system dump
occurs. These functions must return a pointer to a component dump table
structure. These functions and the component dump table entry both must reside
in pinned global memory. They must be registered to the kernel by the dmp_add
and unregistered using dmp_del kernel services. Kernel specific areas are preloaded by kernel initialization.
Component
dump tables
Dump component tables are structures of type struct_cdt. Component dump tables
are returned by the dmp registered functions when the dump process start. Each
one is a structure made of:
• a CDT Header
• an array of CDT entries
CDT Header
The CDT Header contains:
• A magic number that can be one of:
• DMP_MAGIC_32 for 32 -bit CDT
• DMP_MAGIC_VR for 32-bit CDT that may contain virtual or real
addresses
• DMP_MAGIC_64 for 64-bit CDT
• the component dump name
• the length of component dump table
CDT entries
CDT entries in the component dump tables will be one of cdt_entry64,
cdt_entry_vr or cdt_entry32 according to the DMP_MAGIC number has defined
in /usr/include/sys/dump.h
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-7 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Dump creation process
Introduction
This section will describe the dump process.
Process
overview
The following steps will be used to write a dump to the dump device:
Step
1
2
3
4
5
-8 of 128 AIX 5L Internals
Action
Interrupts are disabled
0c9 or 0c2 are written to the LED display, if present
Header information about the dump is written to the dump device
The kernel steps through each entry in the master dump table,
calling each component dump routine twice
• Once to indicate that the kernel is starting to dump this
component 1 is passed as a parameter
• Again to say that the dump process is complete 2 is passed
After the first call to a component dump routine, the kernel
processes the CDT that was returned
For each CDT entry, the kernel :
• Checks every page in the identified data area to see if it is in
memory or paged out
• Builds a bitmap indicating each page's status
• Writes a header, the bitmap, and those pages which are in
memory to the dump device
Once all dump routines have been called, the kernel enters an
infinite loop, displaying 0c0 or flashing 888
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
Component dump routines
Description
Component Dump Routines
• When called with a 1:
• Make any necessary preparations for dumping
• For example, they may read device-specific information from an adapter.
The FDDI device driver does this
• Fill in the component dump table
• Most device drivers do this during their initialization
• Return the address of the component dump table
• When called with a 2:
• Clean up after themselves
• In reality, most routines either return immediately, do some debug printfs
and then return, or else they ignore the parameter entirely and return the
same thing every time
Note
A component dump routine may or may not do a lot of work when called with a 1.
Many simply return the address of some previously-initialized CDT, but some (for
example, the thread table and process table dump routines) actually build the CDT
from scratch.
The original rationale for the second call to each dump routine was to provide
notification that the dump process had finished with that component's dump data.
In practice, however, no one really cares. The routines that just return an address
don't even bother to look at the parameter they were passed. The routines that
build the data on the fly look for a 2 and return immediately. The most that any
routine today does with this second call is to issue some debug printf call. This is
generally used to debug the component dump routine itself, by verifying that the
system dump facility was able to successfully process its CDT.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-9 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
bosdebug command
Introduction
The bosdebug command can be used to enable or disable the MODS feature as
well as other kernel debugging parameters.
Any changes made with the bosdebug command will not take effect until the
system is rebooted.
bosdebug
parameters
The bosdebug command accept the following parameters :
• -I: Causes the kernel debug program to be loaded and invoked on each
subsequent reboot.
• -D: Causes the kernel debug program to be loaded on each subsequent reboot.
• -M: Causes the memory overlay detection system to be enabled. Memory
overlays in kernel extensions and device drivers will cause a system crash.
• -s sizelist Causes the memory overlay detection system to promote each of the
specified allocation sizes to a full page, and allocate and hide the next
subsequent page after each allocation. This causes references beyond the end of
the allocated memory to cause a system crash. sizelist is a list of memory sizes
separated by commas. Each size must be in the range from 16 to 2048, and
must be a power of 2.
• -S: Causes the memory overlay detection system to promote all allocation sizes
to the next higher multiple of page size (4096), but does not hide subsequent
pages. This improves the chances that references to freed memory will result in
a crash, but it does not detect reads or writes beyond the end of allocated
memory until that memory is freed.
• -n sizelist: Has the same effect as the -s option, but works instead for network
memory. Each size must be in the range from 32 to 2048, and must be a power
of 2. This causes the net_malloc_frag_mask variable of the 'no' command to be
turned on during boot.
• -o: Turns off all debugging features of the system.
• -L: Displays the current settings for the kernel debug program and the memory
overlay detection system.
• -R on | off: Sets the real-time extensions for multiprocessor systems only.
-10 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
Memory Overlay Detection System
Introduction
The Memory Overlay Detection System (MODS) helps detect memory overlay
problems in the kernel, kernel extensions, and device drivers. The MODS can be
enabled using the bosdebug command.
Problems
detected
Some of the most difficult types of problems to debug are what are generally
called "memory overlays." Memory overlays include the following:
• Writing to memory that is owned by another program or routine
• Writing past the end (or before the beginning) of declared variables or arrays
• Writing past the end (or before the beginning) of dynamically-allocated
memory
• Writing to or reading from freed memory
• Freeing memory twice
• Calling memory allocation routines with incorrect parameters or under
incorrect conditions.
In the kernel environment (including the kernel, kernel extensions, and
device drivers), memory overlay problems have been especially difficult to
debug because tools for finding them have not been available. Starting
with Version 4.2.1, however, the Memory Overlay Detection System
(MODS) helps detect memory overlay problems in the kernel, kernel
extensions, and device drivers.
Note: This feature does not detect problems in application code; it only
watches kernel and kernel extension code.
When to use
MODS
This feature is useful in the following circumstances:
• When developing your own kernel extensions or device drivers and
want to test them thoroughly.
• When asked to turn this feature on by IBM technical support service to
help in further diagnosing a problem that you are experiencing.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-11 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Memory Overlay Detection System
How MODS
works
-- continued
The primary goal of the MODS feature is to produce a dump file that
accurately identifies the problem.
MODS works by turning on additional checking to help detect the
conditions listed above. When any of these conditions is detected, your
system crashes immediately and produces a dump file that points directly
at the offending code. (Previously, a system dump might point to unrelated
code that happened to be running later when the invalid situation was
finally detected.)
If your system crashes while the MODS is turned on, then MODS has most
likely done its job.
To make it easier to detect that this situation has occurred, the IADB/
iadb and KDB/kdb commands have been extensively modified. The
stat subcommand now displays both:
• Whether the MODS (also called "xmalloc debug") has been turned on
• Whether this crash was the result of the MODS detecting an incorrect
situation.
The xmalloc subcommand provides details on exactly what memory
address (if any) was involved in the situation, and displays mini-tracebacks
for the allocation and/or free of this memory.
Similarly, the netm command displays allocation and free records for
memory allocated using the net_malloc kernel service (for example,
mbufs, mclusters, etc.).
You can use these commands, as well as standard crash techniques, to
determine exactly what went wrong.
MODS
limitations
There are limitations to the Memory Overlay Detection System. Although it
significantly improves your chances, MODS cannot detect all memory
overlays. Also, turning MODS on has a small negative impact on overall
system performance and causes somewhat more memory to be used in
the kernel and the network memory heaps. If your system is running at full
CPU utilization, or if you are already near the maximums for kernel
memory usage, turning on the MODS may cause performance
degradation and/or system hangs.
Our practical experience with the MODS, however, is that the great
majority of customers will be able to use it with minimal impact to their
systems.
Continued on next page
-12 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Memory Overlay Detection System
MODS and kdb
Guide
-- continued
If a system crash occurs due to an MODS problem, the kdb xm sub command will
be able to display status and traces on memory overlay problems
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-13 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
System Hang Detection
Introduction
System hang management allows users to run mission critical applications
continually while improving application availability. System hang detection
alerts the system administrator of possible problems and then allows the
administrator to log in as root or to reboot the system to resolve the
problem.
System Hang
Detection
All processes (also know as threads) run at a priority. This priority is
numerically inverted in the range 40-126. Forty is highest priority and 126
is the lowest priority. The default priority for all threads is 60. The priority of
a process can be lowered by any user with the nice command. Anyone
with root authority can also raise a process’s priority.
The kernel scheduler always picks the highest priority runnable thread to
put on a CPU. It is therefore possible for a sufficient number of high priority
threads to completely tie up the machine such that low priority threads can
never run. If the running threads are at a priority higher than the default of
60, this can lock out all normal shells and logins to the point where the
system appears hung.
The System Hang Detection (SHD) feature provides a mechanism to
detect this situation and allow the system administrator a means to
recover. This feature is implemented as a daemon (shdaemon) that runs at
the highest process priority. This daemon queries the kernel for the lowest
priority thread run over a specified interval. If the priority is above a
configured threshold, the daemon can take one of several actions. Each of
these actions can be independently enabled, and each can be configured
to trigger at any priority and over any time interval. The actions and their
defaults are:
Continued on next page
-14 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
System Hang Detection
-- continued
System Hang
Detection
continued
Action
Default
Enabled
Default
Priority
Default
Device
60
Default
Timeout
(Seconds)
120
Log an error in
errlog
Display a
warning
message
Give a recovery
getty
Launch a
command
Reboot the
system
disabled
disabled
60
120
/dev/console
enabled
60
120
/dev/tty0
disabled
60
120
disabled
39
300
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-15 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
System Hang Detection
shconf Script
-- continued
The shconf command is invoked when System Hang Detection is
enabled. shconf configures which events are surveyed and what actions
are to be taken if such events occur.
The user can specify the five actions described below and can specify the
priority level to check, the time out while no process or thread executes at
a lower or equal priority, the terminal device for the warning action and the
getty action:
• Log an error in the error log file
• Display a warning message on the system console (alphanumeric
console) or on a specified TTY
• Reboot the system
• Give a special getty to allow the user to log in as root and launch
commands
• Launch a command
For the Launch a command and Give a special getty options,
SHD will launch the special getty or the specified command at the highest
priority. The special getty will print a warning message specifying that it is a
recovering getty running at priority 0. The following table lists the default
values when the SHD is enabled. Only one action is enabled per type of
detection.
Note: When Launch a recovering getty on a console is
enabled, the shconf script adds the -u flag to the getty line in the inittab
that is associated with the console login.
Continued on next page
-16 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
System Hang Detection
Process
Guide
-- continued
The shdaemon is in charge of handling the detection of system hang. It
retrieves configuration information, initiates working structures, and starts
detection times set in by the user.
The shdaemon is started by init with a priority zero.
The shdaemon will be set (off/respawn) in the inittab each time the
shconf command will (disable/enable) the sh_pp option.
SMIT Interface
You can manage the SHD configuration from the SMIT System
Environments menu. From the System Environments menu, select
Manage System Hang Detection. The options in this menu allow
system administrators to enable or disable the detection mechanism.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-17 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
System Hang Detection
Configuration of
the SHD
Options
-- continued
The shconf command can be used to configure the System Hang Detection.
The following parameters maybe used with shconf:
• -d: Display the System Hang Detection status.
• -R -l prio: will reset effective values to default.
• -D[O] -l prio: Display the default values (Optional O will output values
separated by colons
• -E[O] -l prio: Display the effective values (Optional O will output values
separated by colons
• -l prio [-a Attribute=Value]: will change the Attribute to the nue Value
The following options can be used to customize the System Hang Detection
:
name
default
description
sh_pp
enable
Enable Process Priority Problem
pp_errlog
pp_eto
pp_eprio
pp_warning
pp_wto
pp_wprio
pp_wterm
pp_login
pp_lto
pp_lprio
pp_lterm
pp_cmd
pp_cto
pp_cprio
pp_cpath
pp_reboot
pp_rto
pp_rprio
disable
2
60
disable
2
60
/dev/console
enable
2
56
/dev/tty0
disable
2
60
/
disable
5
39
Log Error in the Error Logging
Detection Time-out
Process Priority
Display a warning message on a console
Detection Time-out
Process Priority
Terminal Device
Launch a recovering login on a console
Detection Time-out
Process Priority
Terminal Device
Launch a command
Detection Time-out
Process Priority
Script
Automatically REBOOT system
Detection Time-out
Process Priority
Continued on next page
-18 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
System Hang Detection
example
Guide
-- continued
The following output represent various use of the chconf command:
# shconf -R -l prio <== restore default values
shconf: Default Problem Conf is restored.
shconf: Priority Problem Conf has changed.
# shconf -D -l prio <== display default values
sh_pp
disable
Enable Process Priority Problem
pp_errlog disable
Log Error in the Error Logging
pp_eto 2
Detection Time-out
pp_eprio 60
Process Priority
pp_warning disable
Display a warning message on a console
pp_wto 2
Detection Time-out
pp_wprio 60
Process Priority
pp_wterm /dev/console Terminal Device
pp_login enable
Launch a recovering login on a console
pp_lto 2
Detection Time-out
pp_lprio 56
Process Priority
pp_lterm /dev/tty0 Terminal Device
pp_cmd disable
Launch a command
pp_cto 2
Detection Time-out
pp_cprio 60
Process Priority
pp_cpath /
Script
pp_reboot disable
Automatically REBOOT system
pp_rto 5
Detection Time-out
pp_rprio 39
Process Priority
# shconf -l prio -a pp_lterm=/dev/console <== change terminal device to /dev/console
shconf: Priority Problem Conf has changed.
# shconf -l prio -a sh_pp=enable <== enable priority problem detection
shconf: Priority Problem Conf has changed.
# ps -ef|grep shd <== verify the shdaemon has been started
root 4982 1 0 17:08:17 - 0:00 /usr/sbin/shdaemon
root 9558 9812 1 17:08:22 0 0:00 grep shd
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-19 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
truss command
Description
The truss command executes a specified command, or attaches to listed process
IDs, and produces a trace of the system calls, received signals, and machine faults
a process incurs. Each line of the trace output reports either the Fault or Signal
name, or the Syscall name with parameters and return values. The subroutines
defined in system libraries are not necessarily the exact system calls made to the
kernel. The truss command does not report these subroutines, but rather, the
underlying system calls they make. When possible, system call parameters are
displayed symbolically using definitions from relevant system header files. For
path name pointer parameters, truss displays the string being pointed to. By
default, undefined system calls are displayed with their name, all eight possible
arguments and the return value in hexadecimal format.
Continued on next page
-20 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
truss command
Options
-- continued
The following options can be used with the truss command line
Option
-a
-c
Description
Displays the parameter strings passed in each system call.
Counts traced system calls, faults, and signals rather than displaying
trace results line by line. A summary report is produced.
-e
Displays the environment strings which are passed in each executed
system call.
-f
Follows all children created by the fork system call.
-i
Keeps interruptible sleeping system calls from being displayed. Causes
system calls to be reported only once, upon completion.
-m [!]
Machine faults to trace/exclude. Faults may be specified by name or
Fault
number (see the sys/fault.h header file). The default is -mall.
-o Outfile Designates the file to be used for the trace output.
-p
Interprets the parameters to truss as a list of process ids for an existing
process rather than as a command to be executed. truss takes control of
each process and begins tracing it.
-r [!]
Displays the full contents of the I/O buffer for each read on any of the
FileDescri specified file descriptors. The output is formatted 32 bytes per line and
ptor
shows each byte either as an ASCII character (preceded by one blank) or
as a two-character C language escape sequence for control characters. If
ASCII interpretation is not possible, the byte is shown in two-character
hexadecimal. The default is -r!all.
-s [!]
Permits listing Signals to trace/exclude. The trace output reports the
Signal
receipt of each specified signal even if the signal is being ignored, but not
blocked, by the process. Blocked signals are not received until the process
releases them. Signals may be specified by name or number (see sys/
signal.h). The default is -s all.
-t [!]
Includes/excludes system calls from the trace. The default is -tall.
Syscall
-w [!]
Displays the contents of the I/O buffer for each write on any of the listed
FileDescri file descriptors (see -r). The default is -w!all.
ptor
-x [!]
Displays data from the specified parameters of traced system calls in raw
Syscall
format, usually hexadecimal, rather than symbolically. The default is x!all.
Each option requiring a list must contain a list separated by commas. You can use
“all”/”!all” to include/exclude all possible values of the list.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-21 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
truss command
truss output
example
-- continued
The following output represent an example of the use of a truss command:
WUXVVDHLPVUDOOZDOOROVRXWOV
OVRXW
PRUHOVRXW
H[HFYHXVUELQOV[))[))DUJF
DUJYOV
HQYSB XVUELQWUXVV/$1* &/2*,1 URRW
1/63$7+ XVUOLEQOVPVJ/1XVUOLEQOVPVJ/1FDW
3$7+ XVUELQHWFXVUVELQXVUXFEXVUELQ;VELQ
/&BB)$6706* WUXH/2*1$0( URRW0$,/ XVUVSRROPDLOURRW
/2&3$7+ XVUOLEQOVORF86(5 URRW$87+67$7( FRPSDW
6+(// XVUELQNVK2'0',5 HWFREMUHSRV+20( 7(50 DL[WHUP
0$,/06* ><28+$9(1(:0$,/@3:' KRPHDOH[7= 3673'7$BB] /2*1$0(
BBJHWBNHUQHOBWRGBSWU[[%[['&$
[&
[[$$[( [))))
JHWXLG[ [
NLRFWO[[ NLRFWO[)(&[ VEUN[ [&
EUN[' VEUN[ ['
EUN[ VWDW[[)(& VWDW[[)(&$ RSHQ2B5'21/< JHWGLUHQW[
OVHHN NIFQWO)B*(7)'[' NIFQWO)B6(7)'[ JHWGLUHQW[
JHWGLUHQW[
FORVH NLRFWO[[ NZULWH[$) OVRXW?Q
NIFQWO)B*(7)/[ FORVH NIFQWO)B*(7)/[ BH[LW
-22 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB kernel debugger
Introduction
The KDB is the kernel debugger used on AIX5L running on Power systems.
Availability
The kernel debugger must be enabled in order to be used on AIX5L.
The following command should return 00000001 if the kernel debugger was
enabled:
#kdb
(0)> dw kdb_avail
kdb_avail+000000: 00000001 00000000 00000000 00000000
Overview
The major functions of the KDB are:
•
•
•
•
•
•
Loading KDB
Setting breakpoints within the kernel or kernel extensions
Execution control through various forms of step commands
Formatted display of selected kernel data structures
Display and modification of kernel data
Display and modification of kernel instructions
Modification of the state of the machine through alteration of system registers
In AIX 5L, the KDB is included in all unix kernels found in /usr/lib/boot. In order
to use it, the KDB must be loaded at boot time. To allow KDB to load use the
following command:
• bosboot -a -D -d /dev/ipldevice, or bosdebug -D: will
load KDB at boot time.
• bosboot -a -I -d /dev/ipldevice, or bosdebug -I: will load
and invoke the KDB at boot time.
• bosboot -ad /dev/ipldevice, or bosdebug -o: will not load or
invoke the KDB at boot time.
You must reboot the system in order to take these changes in account.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-23 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB kernel debugger
-- continued
Starting KDB
The KDB interface maybe be started, if loaded, under the following
circumstances:
• If the bosboot or bosdebug was run with -I, this mean that the tty
attached to a native serial port will show up the KDB just after the kernel is
loaded.
• You may invoke manually the KDB from a tty attached to a native serial port
using: Ctrl-4 or Ctrl-\, or from a native keyboard using Ctrl-altNumpad4.
• An application make a call to the breakpoint() kernel services or to the
breakpoint system call.
• A breakpoint previously set using the KDB has been reached
• A fatal system error occurs. A dump might be generated on exit from the KDB.
KDB Concept
When the KDB Kernel Debugger is invoked, it is the only running program until
you exit the KDB or you use the start sub command to start another cpu. All
processes are stopped and interrupts are disabled. The KDB Kernel Debugger runs
with its own Machine State Save Area (mst) and a special stack. In addition, the
KDB Kernel Debugger does not run operating system routines. Though this
requires that kernel code be duplicated within KDB, it is possible to break
anywhere within the kernel code. When exiting the KDB Kernel Debugger, all
processes continue to run unless the debugger was entered via a system halt.
-24 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
kdb command
Introduction
The kdb command, unlike the KDB kernel debugger, allows examination of an
operating system image issued on Power systems.
The kdb command maybe used on a running system but will not provide all
functions available with the KDB kernel debugger.
Parameters
The kdb command maybe used with the following parameters:
• no parameter: the kdb will use /dev/mem as the system image file and /usr/lib/
boot/unix as the kernel file. In this case root permissions are required.
• -m system_image_file: the kdb will use the image file provided.
• -u kernel_file: the kdb will use the kernel file. This is required to analyze a
system dump on a system with different level of unix.
• -k kernel_modules: a comma separated list of kernext symbols to add.
• -w: to view XCOFF object
• -v: to print CDT entries
• -h: to print help
• -l: to disable inline more, useful to run non interactive session.
Loading errors
If the system image file provided doesn’t contain a valid dump or the kernel file
doesn’t match the system image file, the following message may be issued by the
kdb command:
# kdb -m dump_file -u /usr/lib/boot/unix
The specified kernel file is a 64-bit kernel
core mapped from @ 700000000000000 to @ 7000000000120a7
Preserving 884137 bytes of symbol table
First symbol __mulh
KERNEXT FUNCTION NAME CACHE (90112 bytes) allocated
KERNEXT COMMANDS SPACE (8192 bytes) allocated
Component Dump Table not found.
Kernel not included in this dump.
dump core corrupted
make sure /usr/lib/boot/unix refers to the running
kernel
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-25 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB miscellaneous sub commands
Introduction
The following table represents the miscellaneous sub commands and their
matching crash/lldb sub commands when available
machdep
function
reboot the machine
display help
run an aix command
exit
set debugger parameters
display elapsed time
enable/disable debug
calculate/convert an
hexadecimal expression
calculate/convert a decimal
expression
crash/lldb sub
commands
reboot
?/help
!
q/go
KDB sub
commands
kdb sub
commands
N/A
?
!
go
calc/conv
reboot
?
!
q
set
time
debug
hcal/cal
calc/conv
dcal
dcal
N/A
hcal/cal
reboot sub
command
The reboot subcommand can be used to reboot the machine. This subcommand
issues a prompt for confirmation that a reboot is desired before executing the
reboot. If the reboot request is confirmed, the soft reboot interface is called
(sr_slih(1)).
! sub command
The ! sub command allow the user to run an aix command without leaving the
kdb or KDB kernel debugger.
? sub command
Help or ? sub command can be used to display a long sub command listing or to
display help by subjects.
A particular help a a command can be display using the sub command followed by
?
Continued on next page
-26 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB miscellaneous sub commands
q sub command
Guide
-- continued
For the KDB Kernel Debugger, this subcommand exits the debugger with all
breakpoints installed in memory. To exit the KDB Kernel Debugger without
breakpoints, the ca subcommand should be invoked to clear all breakpoints prior
to leaving the debugger.
The optional dump argument can be specified to force an operating system dump.
The method used to force a dump depends on how the debugger was invoked.
set sub
command
The set sub command can be used to toggle the kdb parameters. Set accept the
following parameters:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
time sub
command
none: will display the actual parameters
1: no_symbol
2: mst_wanted
3: screen_size
4: power_pc_syntax
5: origin
6: Unix symbols start from
7: hexadecimal_wanted
8: screen_previous
9: display_stack_frames
10: display_stacked_regs
11: 64_bit
12: ldr_segs_wanted
13: emacs_window
14: Thread attached local breakpoint
15: KDB stops all processors
17: kext_IF_active
18: trace_back_lookup
19: IPI_enable
The time command can be used to determine the elapsed time from the last time
the KDB Kernel Debugger was left to the time it was entered.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-27 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB miscellaneous sub commands
debug sub
command
The debug subcommand may be used to print additional information during
KDB execution, the primary use of this subcommand is to aid in ensuring that the
debugger is functioning properly. The debug sub command can be used with the
following arguments:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
hcal/dcal sub
commands
-- continued
no argument: will display the current debug flags
dbg1++/dbg1--: set/unset FM HW lookup debug.
dbg2++/dbg2--: set/unset vmm tr/tv cmd debug
dbg3++/dbg3--: set/unset vmm SW lookup debug
dbg4++/dbg4--: set/unset symbol lookup debug
dbg5++/dbg5--: set/unset stack trace debug
dbg61++/dbg61--: set/unset BRKPT debug (list)
dbg62++/dbg62--: set/unset BRKPT debug (instr)
dbg63++/dbg63--: set/unset BRKPT debug (suspend)
dbg64++/dbg64--: set/unset BRKPT debug (phantom)
dbg65++/dbg65--: set/unset BRKPT debug (context)
dbg71++/dbg71--: set/unset DABR debug (address) '
dbg72++/dbg72--: set/unset DABR debug (register) '
dbg73++/dbg73--: set/unset DABR debug (status) '
dbg81++/dbg81--: set/unset BRAT debug (address)
dbg82++/dbg82--: set/unset BRAT debug (register) '
dbg83++/dbg83--: set/unset BRAT debug (status)
The hcal subcommand evaluates hexadecimal expressions and displays the
result in both hex and decimal.
The dcal subcommand evaluates decimal expressions and displays the result in
both hex and decimal.
-28 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB dump/display/decode sub commands
Introduction
The following table represents the dump/display/decode sub commands and their
matching crash/lldb sub commands when available
dump/display/decode
function
display byte data
display word data
display double word data
display code
display registers
display device byte
display device half word
display device word
display device double word
display physical memory
find pattern
extract pattern
d/dw/dd/dp/
dpw/dpd sub
commands
crash/lldb sub
commands
display
od (2 units)/display
od (4 untis)/display
id/decode/od (format
I)/un
float/sregs
display
find
link
KDB sub
commands
d
dw
dd
dc/dpc/dis
kdb sub
commands
d
dw
dd
dc/dpc/dis
dr
dr
ddvb/ddpb
ddvh/ddph
ddvw/ddpw
ddvd/ddpd
dp/dpw/dpd
find/findp
ext/extp
N/A
N/A
N/A
N/A
dp/dpw/dpd
find/findp
ext/extp
d/dw/dd/dp/dpw/dpd sub commands are use to display memory with the following
sizes:
• d,dp display bytes
• dw,dpw: display words
• dd,dpd (display double words)
Addresses are specified by:
• virtual addresses for d,dw and dd
• physical for dp,dpw and dpd
These sub commands accept the following arguments:
• Address - starting address of the area to be dumped. hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
• count - number of bytes (d, dp), words (dw, dpw), or double words (dd, dpd)
to be displayed. The count argument is a hexadecimal value.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-29 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB dump/display/decode sub commands
dc/dpc/dis sub
commands
-- continued
The display code subcommands, dc,dis and dpc may be used to decode
instructions. The address argument for the dc subcommand is an effective
address. The address argument for the dpc subcommand is a physical
address. They accept the following arguments:
• Address - address of the code to disassemble. This can either be a
virtual (effective) or physical address, depending on the subcommand
used. Symbols, hexadecimal values, or hexadecimal expressions can be
used in specification of the address.
• count - indicates the number of instructions to be disassembled. The
value specified must be a decimal value or decimal expression.
ddvb/ddvh/
ddvd/ddpv/
ddph/ddpd sub
commands
IO space memory (Direct Store Segment (T=1)) can not be accessed when
translation is disabled. bat mapped areas must also be accessed with translation
enabled, else cache controls are ignored.
The subcommands ddvb, ddvh, ddvw and ddvd can be used to access these areas
in translated mode, using an effective address already mapped.
The subcommands ddpb, ddph, ddpw and ddpd can be used to access these areas
in translated mode, using a physical address that will be mapped.
On 64-bit machine, double words correctly aligned are accessed (ddpd and ddvd)
in a single load (ld) instruction.
DBAT interface is used to translate this address in cache inhibited mode
(PowerPC only).
ddvb/ddvh/ddvd/ddpv/ddph/ddpd sub commands use the following parameters:
• Address - address of the starting memory area to display. This can either be a
effective or real address, dependent on the subcommand used. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
• count - number of bytes (ddvb, ddpb), half words (ddvh, ddph), words (ddvw,
ddpw), or double words (ddvd, ddpd) to display. The count argument is a
hexadecimal value.
Continued on next page
-30 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB dump/display/decode sub commands
-- continued
find findp sub
commands
The find and findp subcommands can be used to search for a specific pattern in
memory. The find subcommand requires an effective address for the address
argument, whereas the findp subcommand requires a real address. find and findp
accept the following parameters:
• -s - flag indicating that the pattern to be searched for is an ASCII string
• Address: address where the search is to begin. This can either be a virtual
(effective) or physical address, depending on the subcommand used. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
• string: ASCII string to search for if the -s option is specified.
• pattern: hexadecimal value specifying the pattern to search for. The pattern is
limited to one word in length.
• mask: if a pattern is specified, a mask can be specified to eliminate bits from
consideration for matching purposes. This argument is a one word hexadecimal
value.
• delta: increment to move forward after an unsuccessful match. This argument
is a one word hexadecimal value.
ext/extp sub
commands
The ext and extp subcommands can be used to display a specific area from a
structure. If an array exists, it can be traversed displaying the specified area for
each entry of the array. These subcommands can also be used to traverse a linked
list displaying the specified area for each entry.
For the ext subcommand the address argument specifies an effective address. For
the extp subcommand the address argument specifies a physical address.
ext and extp accept the following arguments:
• -p: flag to indicate that the delta argument is the offset to a pointer to the next
area.
• Address: address at which to begin display of values. This can either be a
virtual (effective) or physical address depending on the subcommand used.
Symbols, hexadecimal values, or hexadecimal expressions can be used in
specification of the address.
• delta: offset to the next area to be displayed or offset from the beginning of the
current area to a pointer to the next area. This argument is a hexadecimal value.
• size: hexadecimal value specifying the number of words to display.
• count: hexadecimal value specifying the number of entries to traverse
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-31 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB dump/display/decode sub commands
dr sub command
-- continued
The display registers sub command can be used to display:
• gp: general purpose
• sr: segment,
• sp: special, or
• fp: floating point registers.
• register name: Individual registers.
•
The current context is used to locate the values to display. The switch sub
command can be used to change context to other threads.
examples
The following show examples of the use of display sub commands:
# hostname <== get the hostname
oc3b42
# ctrl-\ <== enter the kdb
KDB(0)> find -s 0 oc3b42
utsname+000020: 6F63 3362 3432 0033 3443 3030 0000 0000 oc3b42.34C00....
KDB(0)> dd utsname 3 <== display 3 double word of utsname
utsname+000000: 4149580000000000 0000000000000000 AIX.............
utsname+000010: 0000000000000000 0000000000000000................
utsname+000020: 6F63336234320033 3443303000000000 oc3b42.34C00....
(0)> dr sp <== display the special purposes registers in current context
iar : 000000000000B65C msr : A0000000000090B2 cr : 44284448
lr : 000000000001C950 ctr : 0000000000000020 xer : 0EB8C400
mq : DEADBEEF asr : 000000000EB8E001
dsisr: 00000000 dar : 0000000000000000 dec : 00000000
sdr1: 0000000000000000 srr0: 0000000000000000 srr1: 0000000000000000
dabr: 0000000000000000 tbu : 00000000 tbl : 00000000
sprg0: 0000000000000000 sprg1: 0000000000000000
sprg2: 0000000000000000 sprg3: 0000000000000000
pir : 00000000 pvr : 00000000 ear : 00000000
hid0: 00000000 iabr: 0000000000000000
buscsr: 0000000000000000 l2cr: 0000000000000000 l2sr: 0000000000000000
via : 0000000000000000 sda : 0000000000000000
mmcr0: 00000000 mmcr1: 00000000
pmc1: 00000000 pmc2: 00000000 pmc3: 00000000 pmc4: 00000000
pmc5: 00000000 pmc6: 00000000 pmc7: 00000000 pmc8: 00000000
-32 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB modify memory sub commands
Introduction
The following table represents the modify memory sub commands and their
matching crash/lldb sub commands when available
modify memory
function
modify sequential bytes
modify sequential word
modify sequential double word
modify sequential half word
modify registers
modify device byte
modify device half word
modify device double word
modify physical memory
m/mp/mw/mpw/
md/mpd sub
commands
crash/lldb sub
commands
alter -c/stc
alter -w/st
alter -l
sth
set
KDB sub
commands
m
mw
md
sth
mr
mdvb/mdpb
mdvh/mdph
mdvd/mdpd
mp/mpw/
mpd
kdb sub
commands
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
m/mp/mw/mpw/md/mpd sub commands are use to modify memory with the
following sizes:
• m.mp display bytes
• mw.mpw: display words
• md,mpd (display double words)
Addresses are specified by :
• virtual addresses for m,mw and md
• physical for mp,mpw and mpd
These sub commands accept the following arguments:
• Address - starting address of the area to be dumped. hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
The sub commands will prompt for new values until a “.” value is entered.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-33 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB modify memory sub commands
-- continued
mr sub
commands
The mr sub command can be used to modify general purpose, segment, special, or
floating point registers. Individual registers can also be selected for modification
by register name. The current thread context is used to locate the register values to
be modified. The switch sub command can be used to change context to other
threads. When the register being modified is in the mst context, KDB alters the
mst. When the register being modified is a special one, the register is altered
immediately. Symbolic expressions are allowed as input.
The following arguments can be used:
• gp - modify general purpose registers.
• sr - modify segment registers.
• sp - modify special purpose registers.
• fp - modify floating point registers.
• reg_name - modify a specific register, by name.
mr will prompt for input if a register name was specified, or will prompt for input
until a “.” is entered.
mdvb/mdpb/
mdvh/mdph/
mdvd/mdpd sub
commands
These subcommands are available to write in IO space memory. To avoid bad
effects, memory is not read before, only the specified write is performed with
translation enabled.
Access can be in bytes, half words, words or double words.
Address can be an effective address or a real address.
The subcommands mdvb, mdvh, mdvw and mdvd can be used to access these
areas in translated mode, using an effective address already mapped. The
subcommands mdpb, mdph, mdpw and mdpd can be used to access these areas in
translated mode, using a physical address that will be mapped. On 64-bit machine,
double
words correctly aligned are accessed (mdpd and mdvd) in a single store
instruction. DBAT interface is used to translate this address in cache inhibited
mode (PowerPC only).
These subcommands accept the following parameters:
• Address - address of the memory to modify. This can either be a virtual
(effective) or physical address, dependent on the subcommand used. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
These sub commands will prompt for input until a “.” is entered.
Continued on next page
-34 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB modify memory sub commands
examples
Guide
-- continued
# uname -a<== get utsname structure
oc3b42
# ctrl-\ <== enter the kdb
KDB(0)> dd utsname 6 <== display 6 double word of utsname
utsname+000000: 4149580000000000 0000000000000000 AIX.............
utsname+000010: 0000000000000000 0000000000000000................
utsname+000020: 6F63336234320033 3443303000000000 oc3b42.34C00....
KDB(0)> mw utsname+000020
utsname+000020: 6F633362 = 616c6578
utsname+000024: 34320033 =.
KDB(0)> dw utsname 12 <== display 12 words of utsname
utsname+000000: 41495800 00000000 00000000 00000000 AIX.............
utsname+000010: 00000000 00000000 00000000 00000000................
utsname+000020: 616C6578 34320033 34433030 00000000 alex42.34C00....
utsname+000030: 00000000 00000000 00000000 00000000................
utsname+000040: 30000000 00000000
0.......
KDB(0)>q
# uname -a <== now let see what we did
AIX alex42 0 5 000714834C00
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-35 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB trace sub commands
introduction
The following table represents the trace sub commands and their matching crash/
lldb sub commands when available
trace function
set/list trace point
clear trace point
clear all trace points
bt sub command
crash/lldb sub
commands
loop
KDB sub
commands
bt
ct
cat
kdb sub
commands
N/A
N/A
N/A
The trace point subcommand bt can be used to trace each execution of a specified
address. Each time a trace point is encountered during execution, a message is
displayed indicating that the trace point has been encountered. The displayed
message indicates the first entry from the stack.
The bt sub command accept the following parameters:
• -p - flag to indicate that the trace address is a real address.
• -v - flag to indicate that the trace address is an virtual address.
• Address - address of the trace point. This may either be a virtual (effective) or
physical address. Symbols, hexadecimal values, or hexadecimal expressions
may be used in specifying an address.
• script - a list of subcommands to be executed each time the indicated trace
point is executed. The script is delimited by quote (") characters and commands
within the script are delimited by semicolons (;).
The bt sub command can also use a test parameter to break at the specified address
only if the test condition is true
The conditional test requires two operands and a single operator. Values that can
be used as operands in a test subcommand include symbols, hexadecimal values,
and hexadecimal expressions. Comparison operators that are supported include:
==, !=, >=, <=, >, and <.
Additionally, the bitwise operators ^ (exclusive OR), & (AND), and | (OR) are
supported.
When bitwise operators are used, any non-zero result is considered to be true.
Continued on next page
-36 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB trace sub commands
ct/cat sub
command
-- continued
The cat and ct sub commands erase all and individual trace points, respectively.
The trace point cleared by the ct subcommand may be specified either by a slot
number or an address. These sub commands accept the following arguments:
•
•
•
•
examples
Guide
-p: flag to indicate that the trace address is a real address.
-v: flag to indicate that the trace address is an virtual address.
slot: slot number for a trace point. This argument must be a decimal value.
Address:address of the trace point. This may either be a virtual (effective) or
physical address. Symbols, hexadecimal values, or hexadecimal expressions
may be used in specifying an address.
The following example show the use of the trace sub commands:
# <== ctrl-\ to enter the KDB from a native serial port
Debugger entered via keyboard.
.waitproc_find_run_queue+00006C srwi r29,r31,3
<00000000> r29=F
1000097140A011C,r31=0
KDB(0)> bt open <== add a trace point at open() address
.open+000000 (sid:00000000) trace {hit: 0}
KDB(0)> q <== exit the debugger
# ls <== run some command to call open
[0][00387D04]open+000000 (0000000020008B88, 0000000000000000,
00000000000001B6 [??])
[0][00387D04]open+000000 (0000000020000CA4, 0000000000000000,
00000000F00A0810 [??])
.bash_history dev
lpp
sbin
u
.bashrc
etc
lpp_name
scratch
unix
.sh_history home
mnt
smit.log
usr
.xerrors
j2
opt
smit.script var
audit
lib
proc
tftpboot
bin
lost+found qd0
tmp
# <== ctrl-\ to enter the KDB from a native serial port
KDB(0)> bt open "dr" <== will run dr when open is entered
.open+000000 (sid:00000000) trace {hit: 0}
KDB(0)> q <== exit the debugger
# ls <== run some command to call open
r0 : 00000000000090B2 r1 : F00000002FF3B390 r2 : 000000000046AC80
r3 : 0000000020008B88 r4 : 0000000000000000 r5 : 00000000000001B6
r6 : 0000000000000000 r7 : 0000000000000000 r8 : 000000001E821C00
r9 : 0000000000000000 r10 : 0000000011D3E8F0 r11 : F00000002FF3B400
r12 : F10000971E821C00 r13 : F10000971F1FF200 r14 : 0000000000000001
r15 : 000000002000D2A8 r16 : 000000002FF22D6C r17 : 00000000FFFFFFCB
r18 : 0000000000000001 r19 : 0000000000000000 r20 : 0000000020007680
r21 : 0000000000000000 r22 : 0000000000002CB6 r23 : 0000000000000000
r24 : 000000002FF229F0 r25 : 0000000000000014 r26 : 000000002000D2DC
r27 : 0000000000000000 r28 : 00000000F0061768 r29 : 00000000FFFFFFFF
r30 : 00000000D0054FAC r31 : 0000000000000000
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-37 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB break point and step sub commands
Introduction
The following table represents the breakpoint and step sub commands and their
matching crash/lldb sub commands when available
breakpoint and step
function
set/list break point
set/list local break point
clear local break point
clear break points
clear all breakpoint
go to end of function
go until address
next instruction
step on bl/blr
step on branch
b/lb sub
command
crash/lldb
sub commands
break/breaks
break/breaks
clear
clear
clear
step
KDB sub
commands
b
lb
lc
c
ca
r
gt
n/nextis/stepi
S
B
kdb sub
commands
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
The b subcommand sets a permanent global breakpoint in the code. KDB checks
that a valid instruction will be trapped. If an invalid instruction is detected a
warning message is displayed. If the warning message is displayed the breakpoint
should be removed; otherwise, memory can be corrupted (the breakpoint has been
installed).
The lb sub command will act the same way as the b sub command except the
break point will be local to the thread or cpu depending on the set option 14.
The following arguments may be used with the b/lb sub commands :
• -p - flag to indicate that the breakpoint address is a real address.
• -v - flag to indicate that the breakpoint address is an virtual address.
• Address - address of the breakpoint. This may either be a virtual (effective) or
physical address. Symbols, hexadecimal values, or hexadecimal expressions
may be used in specification of the address.
Continued on next page
-38 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB break point and step sub commands
c/lc/ca sub
commands
-- continued
c/lc and ca can be used to clear break points. The differences are:
• c will clear general break points
• lc will clear local break points
• ca will clear all break points
•
The b and lc sub commands will use the following parameters:
•
•
•
•
-p - flag to indicate that the breakpoint address is a real address.
-v - flag to indicate that the breakpoint address is an virtual address.
slot - slot number of the breakpoint. This argument must be a decimal value.
Address - address of the breakpoint. This may either be a virtual (effective) or
physical address. Symbols, hexadecimal values, or hexadecimal expressions
may be used in specification of the address.
The lc may use this additional parameter:
• ctx - context to be cleared for a local break. The context may either be a CPU or
thread specification.
r/gt sub
command
A non-permanent breakpoint can be set using the subcommands r and gt. These
subcommands set local breakpoints which are cleared after they have been hit.
The r subcommand sets a breakpoint on the address found in the lr register. In
SMP environment, it is possible to hit this breakpoint on another processor, so it is
important to have thread/process local break point.
The gt subcommand performs the same as the r subcommand except that the
breakpoint address must be specified.
r and gt sub commands accept the following parameters:
• -p - flag to indicate that the breakpoint address is a real address.
• -v - flag to indicate that the breakpoint address is an virtual address.
• Address - address of the breakpoint. This may either be a virtual (effective) or
physical address. Symbols, hexadecimal values, or hexadecimal expressions
may be used in specification of the address.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-39 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB break point and set sub commands
n/s sub
command
-- continued
The two subcommands n and s provide step functions. The s subcommand allows
the processor to single step to the next instruction. The n subcommand also single
steps, but it steps over subroutine calls as though they were a single instruction.
The n/s sub commands accept the following parameter:
• count: specify how many steps are executed before returning to the KDB
prompt.
S/B sub
commands
The S subcommand single steps but stops only on bl and br instructions. With that,
you can see every call and return of routines. A count can also be used to specify
how many times KDB continues before stopping.
The B subcommand steps stopping at each branch instruction.
The S/B sub commands accept the following parameter:
• count: specify how many steps are executed before returning to the KDB
prompt.
Continued on next page
-40 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB break point and step sub commands
Example
Guide
-- continued
The following example will show the use of break points:
# Debugger entered via keyboard.
.waitproc_find_run_queue+00006C srwi r29,r31,3
<00000000> r29=0,r31=0
KDB(0)> br open <== we set a break point on open.
.open+000000 (sid:00000000) permanent & global
KDB(0)> q <== we exit the kdb
# ls <== do some command that will certainly call open
Breakpoint <== open was called so we enter the KDB
.open+000000
li r6,0
<0000000000000000> r6=0
KDB(0)> s <== do one step
.open+000004 stdu stkp,FFFFFF80(stkp)
stkp=F00000002FF3B390,FFFFFF80(stkp)=F00000002FF3B310
KDB(0)> n <== an other one
.open+000008 mflr r0
<.sys_call_ret+000000>
KDB(0)> dis.open+000008 32 <== not let’s find a the following branch
.open+000008 mflr r0
.open+00000C extsw r4,r4
.open+000010 addi r7,stkp,70
.open+000014 std r0,90(stkp)
.open+000018 clrlwi r5,r5,0
.open+00001C
bl <.copen> <== here it is
.open+000020 ori r0,r3,0
.open+000024 clrlwi r4,r3,0
KDB(0)> B <== this will break at the next branch taht should be open+1c
.open+00001C
bl <.copen>
r3=0000000020008B88
KDB(0)> s <== we step that branch
.copen+000000 std r31,FFFFFFF8(stkp) r31=0,FFFFFFF8(stkp)=F00000002FF3B
308
KDB(0)> dr lr <== let see what is in the link register
lr : 0000000000387D24
.open+000020 ori r0,r3,0
<0000000020008B88> r0=0000000000003
77C,r3=0000000020008B88
KDB(0)> r <== break on the lr (we will return to the calling function)
.open+000020 ori r0,r3,0
<0000000000000000> r0=0000000000000030,r3=0
KDB(0)> ca <== clear all break point before leaving
KDB(0)> q <== exit the KDB
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-41 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB name list/symbol sub commands
Introduction
The following table represents the name list/symbol sub commands and their
matching crash/lldb sub commands when available
name list symbol
function
translate symbol to eaddr
no symbol mode (toggle)
translate eaddr to symbol
nm sub
command
crash/lldb sub
commands
nm
hide
ts/ds
KDB sub
commands
nm
ns
ts
kdb sub
commands
nm
ns
ts
The nm subcommand translates symbols to addresses.nm use the following
argument:
• symbol - symbol name.
ns sub command
The ns subcommand toggles symbolic name translation on and off.
This is equivalent to the set sub command option 1.
ts sub command
The ts subcommand translates addresses to symbolic representations. ts use the
following argument:
• Address - effective address to be translated. This argument may be a
hexadecimal value or expression.
examples
(0)> nm kdb_avail <== display addresses for the kdb_avail symbol
Symbol Address: 0046AE70
TOC Address: 0046AC80
(0)> set 1 <== turn address translation off
Symbolic name translation off
(0)> ts 046AE70 <== get symbol for 046AE70
0046AE70 <== didn’t get it because address translation is turned off
(0)> ns <== turn address translation back on
Symbolic name translation on
(0)> ts 046AE70 <== no we should get the symbol
kdb_avail+000000
-42 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB watch break point sub commands
Introduction
The following table represents the watch break point sub commands and their
matching crash/lldb sub commands when available
watch break point
function
stop on read data
stop on write data
stop on r/w data
local stop on read data
local stop on write data
local stop on r/w data
clear watch
local clear watch
wr, ww, wrw,
lwr, lww, lwrw,
cw and lcw sub
commands
crash/lldb sub
commands
watch
watch
watch
watch
watch
watch
cw
lcw
KDB sub
commands
wr
ww
wrw
lwr
lww
lwrw
cw
lcw
kdb sub
commands
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
On PowerPC architecture, a watch register (the DABR Data Address Breakpoint
Register or HID5 on Power 601) can be used to enter KDB when a specified
effective address is accessed. The register holds a double-word effective address
and bits to specify load and/or store operation.
So the watch break points can be used with the following rules
• wr and lwr will break on read
• ww and lww will break on write
• wrw and lwrw will break on read or write
• wr,ww and wrw will break in any context
• lwr,lww and lwrw will break in a specific cpu.
• cw and lcw will clear general or local watch break points.
wr, ww, wrw, lwr, lww, lwrw,cw and lcw will accept the following arguments:
• -p: flag indicating that the address argument is a physical address.
• -v: flag indicating that the address argument is a virtual address.
• -e: flag indicating that the address argument is an effective address.
• Address: address to be watched. Symbols, hexadecimal values, or hexadecimal
expressions can be used in specification of the address.
• size: indicates the number of bytes that are to be watched. This argument is a
decimal value.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-43 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB watch break point sub commands
examples
-- continued
KDB(0)> wr utsname 3 <== set a break on read of utsname for 3 bytes
CPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ON
CPU 1: utsname+000000 eaddr=001CB9C8 size=3 hit=0 mode=R Xlate ON
KDB(0)> q <== exit the debugger
# uname -a <== run some command that will read the utsname
Watch trap: 001CB9C8 <utsname+000000>
.umem_move+000030 lbzx r7,r6,r3 r7=000000000000B6B4, r6=0, r3=00000000001CB9C8
KDB(0)> wr <== verify the number of hits -------v
CPU 0: utsname+000000 eaddr=001CB9C8 size=3 hit=1 mode=R Xlate ON
CPU 1: utsname+000000 eaddr=001CB9C8 size=3 hit=1 mode=R Xlate ON
KDB(0)> cw <== clear watch break points
KDB(0)> lwr utsname <== now set a local watch break point (only cpu 0)
CPU 0: utsname+000000 eaddr=001CB9C8 size=8 hit=0 mode=R Xlate ON
KDB(0)> lcw <== clear local watch break points
KDB(0)> q <== exit kdb, will resume the current thread
AIX oc3b42 0 5 000714834C00
-44 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB machine status sub commands
Introduction
The following table represents the status sub commands and their
matching crash/lldb sub commands when available
machine status
function
system status message
stat sub
command
crash/lldb
sub commands
stat/reason/sysinfo
KDB sub
commands
stat
kdb sub
commands
stat
The stat subcommand displays system statistics, including the last kernel printf()
messages, still in memory. The following information is displayed for a processor
that has crashed:
• Processor logical number
• Current Save Area (CSA) address
• LED value
For the KDB Kernel Debugger this subcommand also displays the reason why the
debugger was entered. There is one reason per processor.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-45 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB machine status sub commands
example
-- continued
KDB(0)> stat
SYSTEM_CONFIGURATION:
POWER_PC POWER_630 machine with 2 cpu(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. oc3b42
release... 0
version... 5
machine... 000714834C00
nid....... 0714834C
Debugger entered via keyboard.
age of system: 18 hr., 8 min., 13 sec.
xmalloc debug: enabled
Debug kernel error message: No debug cause was specified.
SYSTEM MESSAGES:
AIX Version 5.0
Starting NODE#000 physical CPU#002 as logical CPU#001... done.
kmod_load failed for psekdb
All Rights Reserved (C) Copyright Platypus Technology International Holdings Limited
qik_alert: Unit is not ready!
init 0
.?.?.?.?.?.?.?.?.?.?
ERROR LOG: for mtn_get_adpt_info, location= 1 0, 1105A, DEAFDEAF
.?.?.?.?.?.?.?.?.?.?.qik_alert: Unit is not ready!
init 2
.?!.?.?.?.?.?.?.?.?.?.! J2 Bring Up: gfs:0x00000001
Number of CPUs: 2
L1 Data Cache Line Size: 128
System Memory Size: 512 MByte
VMM minPageReadAhead:2 maxPageReadAhead:8
nCacheClass:5
iCache: inodeSize:888(vode:88,inode:800(gnode:104,dinode:512))
iCache: nInode:52225 nCacheClass:5 nHashClass:8192
nCache: nName:65536 nHashClass:8192
jCache: nBuffer:5120 bufferHeaderSize(176:208)
jCache: nCacheClass:5 nBufferPerCacheClass:1024
vmPager: nBufferPerPagerDevice:512
txCache: nTxBlock:1024 txBlockSize:88
txCache: nTxLock:57400 txLockSize:72 lockShortage:53813
j2_debug: Error Log Table j2Error:0x003F5580
j2_debug: Event Trace Table j2Trace:0x003F9588
J2 Bring UP Complete.
j2_mount: Mount Failure: File System Dirty.
lockd: cannot contact statd(), continuing<- end_of_buffer
KDB(0)> q
-46 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB kernel extension loader sub commands
Introduction
The following table represents the kernel extension loader sub commands and
their matching crash/lldb sub commands when available
kernel extension loader
function
list loaded extension
list loaded symbol tables
remove symbol table
list export tables
lke and stbl sub
commands
crash/lldb
sub commands
le
map
KDB sub
commands
lke
stbl
rmst
exp
kdb sub
commands
lke
stbl
rmst
exp
The subcommands lke and stbl can be used to display current state of loaded
kernel extensions using the following parameters:
• -l: list the current entries in the name list cache.
• Address: effective address for the text or data area for a loader entry. The
specified entry is displayed and the name list cache is loaded with data for that
entry. The Address can be specified as a hexadecimal value, a symbol, or a
hexadecimal expression.
• -a addr: display and load the name list cache with the loader entry at the
specified address. The Address can be a hexadecimal value, a symbol, or a
hexadecimal expression.
• -p pslot: display the shared library loader entries for the process slot indicated.
The value for pslot must be a decimal process slot number.
• -l32: display loader entries for 32-bit shared libraries.
• -l64: display loader entries for 64-bit shared libraries.
• slot: slot number. The specified value must be a decimal number.
rmst sub
command
A symbol table can be removed from KDB using the rmst subcommand. This
subcommand requires that either a slot number or the effective address for the
loader entry of the symbol table be specified.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-47 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB kernel extension loader sub commands
-- continued
exp sub
command
The exp subcommand can be used to look for an exported symbol or to display the
entire export list. If no argument is specified the entire export list is printed. If a
symbol name is specified as an argument, then all symbols which begin with the
input string are displayed.
examples
(0)> nm kbdconfig <== get address for symbol kbdconfig
Symbol Not Found<== not found because it is in a kernext not in cache
(0)> lke -l <== the cache is empty
KERNEXT FUNCTION NAME CACHE empty
(0)> lke <== list kernel extensions
.
.
21 01978B00 01AE9000 000063D0 00080262 msedd_chrp64/usr/lib/drivers/isa/msedd_chrp
22 01978900 01ACA000 00008F68 00080262 kbddd_chrp64/usr/lib/drivers/isa/kbddd_chrp
(0)> lke 22 <== load kernext into the cache
ADDRESS FILE FILESIZE FLAGS MODULE NAME
22 01978900 01ACA000 00008F68 00080262 kbddd_chrp64/usr/lib/drivers/isa/kbddd_c
hrp
le_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64
le_next........ 01978A00 le_svc_sequence 66666666
le_fp.......... 00000000
le_filename.... 01978988 le_file........ 01ACA000
le_filesize.... 00008F68 le_data........ 01AD2100
le_tid......... 01AD2100 le_datasize.... 00000E68
le_usecount.... 00000002 le_loadcount... 00000002
le_ndepend..... 00000001 le_maxdepend... 00000001
le_ule......... 00000000 le_deferred.... 00000000
le_exports..... 00000000 le_de.......... 6666666666666666
le_searchlist.. 00000000 le_dlusecount.. 00000000
le_dlindex..... FFFFFFFF le_lex......... 00000000
le_fh.......... 00000000 le_depend.... @ 01978980
TOC@........... 01AD2C50
<PROCESS TRACE BACKS>
.ureg_pm 01ACA1C0
.reg_pm 01ACA25C
.qvpd 01ACA3B4
.initadpt 01ACA520
.cleanup 01ACA754
.kbdconfig 01ACA924
.
.
(0)> lke -l <== see if it was loaded correctly
KERNEXT FUNCTION NAME CACHE
.ureg_pm 01ACA1C0
.reg_pm 01ACA25C
.qvpd 01ACA3B4
.initadpt 01ACA520
.cleanup 01ACA754
.kbdconfig 01ACA924
.
.
(0)> nm kbdconfig <== no see if we find the address for the symbol
Symbol Address : 01ACA924
TOC Address : 01AD2C50
-48 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB address translation sub commands
Introduction
The following table represents the address translation sub commands and their
matching crash/lldb sub commands when available
address translation
function
translate to real address
display MMU translation
tr and tv sub
commands
crash/lldb sub
commands
xlate
xlate
KDB sub
commands
tr
tv
kdb sub
commands
tr
tv
The tr and tv sub commands can be used to display address translation
information. The tr sub command provides a short format; the tv subcommand a
detailed format.
For the tv subcommand, all double hashed entries are dumped, when the entry
matches the specified effective address, corresponding physical address and
protections are displayed. Page protection (K and PP bits) is displayed according
to the current segment register and machine state register values.
tr and tv sub commands takes the following arguments :
• Address - effective address for which translation details are to be displayed.
Symbols, hexadecimal values, or hexadecimal expressions can be used in
specification of the address.
examples
(0)> tr @iar <== display the physical address of the current instruction
Physical Address = 000000000002CB58
(0)> tv @iar <== display the physical mapping of the current instruction
eaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash1 0000002
C
p64pte_cur_addr 0000000001001600 sid 0000000000000000 avpi 00 hsel 0 valid 1
rpn 000000000000002C refbit 1 modbit 0 wimg 2 key 3
____ 000000000002CB58 ____ K = 0 PP = 11 ==> read only
eaddr 000000000002CB58 sid 0000000000000000 vpage 000000000000002C hash2 0000FFD
3
Physical Address = 000000000002CB58
(0)>
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-49 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB process/thread sub commands
Introduction
The following table represents the process/thread sub commands and their
matching crash/lldb sub commands when available
process
function
display per processor data
display interrupt handler
display mst area
display process table
display thread table
display thread tid
display thread pid
display user area
switch thread
ppda sub
command
crash/lldb
sub commands
ppd
mst/tcb
proc
th
th
user
cm
KDB sub
commands
ppda
intr
mst
proc
th
ttid
tpid
user
sw/switch
kdb sub
commands
ppda
intr
mst
proc
th
ttid
tpid
user
sw/switch
The ppda sub command displays per processor data areas with the following
conditions :
• no arguments : displays the current process data area
• * : display a summary for all CPUs.
• cpu : display the ppda data for the specified CPU. This argument must be a
decimal value.
• Address : effective address of a ppda structure to display. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
intr sub
command
The intr sub command prints entries in the interrupt handler table with the
following conditions :
• no arguments : display a summary of all entries in the interrupt handler table.
• slot : slot number in the interrupt handler table. This value must be a decimal
value.
• Address : effective address of an interrupt handler. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
Continued on next page
-50 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB process/thread sub commands
mst sub
command
-- continued
The mst sub command prints Machine State Save Area for :
• the current context : if no argument is provided
• slot : thread slot number. This value must be a decimal value.
• Address : effective address of an mst to display. Symbols, hexadecimal values,
or hexadecimal expressions can be used in specification of the address.
proc sub
command
The proc subcommand displays process table entries using :
• * : display a summary for all processes.
• -s flag : display only processes with a process state matching that specified by
flag. The allowable values for flag are: SNONE, SIDLE, SZOMB, SSTOP,
SACTIVE, and SSWAP.
• slot : process slot number. This value must be a decimal value.
• Address : effective address of a process table entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
th sub command
The thread subcommand displays thread table entries using :
• no argument : the current thread is displayed.
• * :display a summary for all thread table entries.
• -w flag : display a summary of all thread table entries with a wtype matching
the one specified by the flag argument. Valid values for the flag argument
include: NOWAIT, WEVENT, WLOCK, WTIMER, WCPU, WPGIN,
WPGOUT, WPLOCK, WFREEF, WMEM, WLOCKREAD, WUEXCEPT, and
WZOMB.
• slot :thread slot number. This must be a decimal value.
• Address :effective address of a thread table entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
ttid and tpid sub
commands
ttid and tpid respectively display :
• the thread table entry by thread id
• the threads table entry by process id
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-51 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB process/thread sub commands
user sub
command
-- continued
The user subcommand displays u-block information for :
• no argument : the current process
• slot : slot number of a thread table entry. This argument must be a decimal
value.
• Address : effective address of a thread table entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
The following parameters can be used to reduce the output from user :
• -ad : display adspace information only.
• -cr : display credential information only.
• -f : display file information only.
• -s : display signal information only.
• -ru : display profiling/resource/limit information only.
• -t : display timer information only.
• -ut : display thread information only.
• -64 : display 64-bit user information only.
• -mc : display miscellaneous user information only.
sw sub
command
By default, KDB shows the virtual space for the current thread. The sw
subcommand allows selection of the thread to be considered the current thread.
Threads can be specified by slot number or address. The current thread can be
reset to its initial context by entering the sw subcommand with no arguments.
For the KDB Kernel Debugger, the initial context is also restored whenever
exiting the debugger.
sw will use the following arguments :
•
•
•
•
u : flag to switch to user address space for the current thread.
k : flag to switch to kernel address space for the current thread.
th_slot : specifies a thread slot number. This argument must be a decimal value.
th_Address : address of a thread slot. Symbols, hexadecimal values,
orhexadecimal expressions can be used in specification of the address.
Continued on next page
-52 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB process/thread sub commands
examples
-- continued
(0)> ppda * <== display all ppda summary
SLT CSA
CURTHREAD SRR0
ppda+000000 0 F00000002FF3B400 KERN_heap+40ABC00 D0059E18
ppda+001000 1 F00000002FF3B400 KERN_heap+E8FBC00 000000010000120C
(0)> ppda <== display ppda for current cpu (0)
Per Processor Data Area [0014ED80]
csa..............F00000002FF3B400 mstack...........0000000000838DF8
fpowner..........0000000000000000 curthread........F1000097140ABC00
syscall..........000000000008202E intr.............0000000000000000
i_softis.....................0000 i_softpri....................0000
prilvl...........F1000097140C1600 worst_run_pri................00FFrun_pri........................FF
v_pnda...........00000000001FC570
cpunidx......................0000
ppda_pal[0]..............00000000 ppda_pal[1]..............00000000
ppda_pal[2]..............00000000 ppda_pal[3]..............00000000
phy_cpuid....................0000 sradid.......................0000
slb_reload_index.............0000 ppda_fp_cr...............00000000
flih save[0].....0000000020000000 flih save[1].....000000000001E10C
flih save[2].....A000000000009032 flih save[3].....0000000000000000
flih save[4].....0FFFFFFFF3FFFE80 flih save[5].....000000000046AC80
flih save[6].....0000000000000000 flih save[7].....0000000000000000
flih save[8].....0000000000000000 flih save[9].....0000000000000000
flih save[10].....0000000000000000
usegp............0000000000000000 srflag...........7000000000000000
srsave[0]........000000000000736F srsave[1]........000000000000736F
srsave[2]........0000000000000000 srsave[3]........0000000000000000
srsave[4]........0000000000000000
gsegs[0].eaddr...0000000000000000 gsegs[0].vsid....0000000000000000
gsegs[1].eaddr...0000000000000000 gsegs[1].vsid....0000000000000000
gsegs[2].eaddr...0000000000000000 gsegs[2].vsid....0000000000000000
gsegs[3].eaddr...0000000000000000 gsegs[3].vsid....0000000000000000
Useracc addr.........0000000000000000
Useracc size.........0000000000000000
dsisr....................42000000 dsi_flag.................00000003
dar..............0000000020010920dssave[0]........0000000000000020 dssave[1]........000000002FF226F0
dssave[2]........00000000F009E9BC dssave[3]........000000002000F8E0
dssave[4]........00000000F0046E28 dssave[5]........0000000000000000
dssave[6]........0000000000000000 dssave[7]........00000000200454E0
dssrr0...........00000000D0052904 dssrr1...........200000000000D0B2
dssprg1..........000000002FF22D54 dsctr............0000000002155980
dslr.............000000000038F248 dsxer....................20000008
dsmq.....................00000000 pmapstk..........00000000001CF8D0
pmapsave64.......0000000000000000 pmapcsa..........0000000000000000
schedtail[0].....0000000000000000 schedtail[1].....0000000000000000
schedtail[2].....0000000000000000 schedtail[3].....0000000000000000
cpuid........................0000 stackfix.......................00
lru............................00 vmflags..................00000000
sio............................00 reservation....................00
hint...........................00 no_vwait.......................00
lock.....................00000000
scoreboard[0]....0000000000000000 scoreboard[1]....0000000000000000
scoreboard[2]....0000000000000000 scoreboard[3]....0000000000000000
scoreboard[4]....0000000000000000 scoreboard[5]....0000000000000000
scoreboard[6]....0000000000000000 scoreboard[7]....0000000000000000
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-53 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB process/thread sub commands
example
continued
-- continued
intr_res1................00000000 intr_res2................00000000
mpc_pend.................00000000 iodonelist.......0000000000000000
run_queue........F1000097140A1000 global_run_queue.F1000097140A0118
ppda_timer.... @ 000000000014F0B0 decompress.......0000000000000000
TB_ref_u.................01580CBC TB_ref_l.................40000000
sec_ref..................39B7F005 nsec_ref.................0C3B4A07
_ficd....................00000000 icndx........................07F7
ppda_qio.................00000000 cs_sync..................00000000
perfmon_sv[0]....0000000000000000 perfmon_sv[1]....0000000000000000
thread_private...........00000000 cpu_priv_seg.....0000000000000000
ri_flih_paddr....0000000000F28F00 ri_save6.........0000000000000000
util_start_time..........00000000 util_accumulator.........00000000
ppda_ha_event....0000000000000000 ppda_ha_fun......0000000000000000
ppda_ha_arg......0000000000000000
wp_available.............00000001
frs_id.......................0000 memp_id........................00
newprivseg...............00000000
trace vectors. @ 000000000014F1F0 ppda_trcbufp0....0000000000000000
wlm_cpulocal_dataF100009716320000
WLM (Only non-null slots are shown)........
Slot
time npages
ppda_dseg_count..0000000000000000 ppda_iseg_count..0000000000000000
ppda_emul_tptr...0000000000000000 ppda_align_iar...000000000000B658
ppda_align_tptr..F1000097165A2A00 ppda_align_ea....F1000082C01BC926
ppda_emul_iar....0000000000000000 ppda_emul_count..........00000000
ppda_align_count.........00451303 radindex...... @ 000000000014EE84
TIMER....................
t_free...........F10000971E87D200 t_active.........F100009713FF3100
t_freecnt................00000001 trb_called.......0000000000000000
trb_lock...... @ 000000000014F0D0 trb_lock.........0000000000000000
systimer.........F100009713FF3100 ticks_its................00000042
ref_time.tv_sec..0000000039B7F006 ref_time.tv_nsec.........0EA6319F
time_delta.......0000000000000000 time_adjusted....F100009713FF3100
wtimer.next......F100009716458180 wtimer.prev......F10000971ECD42D0
wtimer.func......0000000000203F80 wtimer.count.....0000000000000000
wtimer.restart...0000000000000000 w_called.........0000000000000000
watchdog_lock. @ 000000000014F138 watchdog_lock....0000000000000000
KDB......................
kdb_ppda_r0......0000000000000001 kdb_ppda_r1......000000002FF228B0
kdb_ppda_r2......00000000F01951F4 kdb_ppda_r15.....000000002FF22D54
kdb_ppda_srr0....00000000D043AB18 kdb_ppda_srr1....200000000004D0B2
flih_save................22282229 proc_state...............0000000B
csa..............0000000000CD8A88
ri_flih_paddr....0000000000F28F00 ri_r6............0000000000000000
(0)> intr <== display the interrupt handler table
SLT
INTRADDR HANDLER TYPE LEVEL PRIO BID FLAGS
i_data+0000E8 5 F1000097140B0FC0 00000000 0004 00000004 0003 900000C0 0050
i_data+0000E8 5 F10000971ECD4000 019EA5C0 0004 0000000D 0003 900000C0 0050
.
.
Continued on next page
-54 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB process/thread sub commands
example
continued
Guide
-- continued
(0)> mst <== display the current mst
Machine State Save Area
iar : 000000000002CB58 msr : A0000000000010B2 cr : 28442224
lr : 0000000000000000 ctr : 00000000003E2150 xer : 20000000
mq : FFFFFFFF asr : 0000000005622001
r0 : 0000000044484244 r1 : F00000002FF3B200 r2 : 000000000046AC80
r3 : 00000000003356E4 r4 : A0000000000090B2 r5 : F1000097163BF301
r6 : F00000002FF3AF40 r7 : 0000000000000105 r8 : 000000000014FD80
r9 : 0000000000000001 r10 : 00000000000021B6 r11 : 0000000000000105
r12 : 000000000020CDD0 r13 : F1000097140AB600 r14 : 0000000000000004
r15 : 0000000011000081 r16 : 0000000070000080 r17 : 0000000000000001
r18 : 0000000000000003 r19 : 0000000000000000 r20 : 00000000FFFEFBFF
r21 : F1000097140AB778 r22 : 0000000048242224 r23 : 0000000000000000
r24 : 0000000000000000 r25 : 000000000000000B r26 : 0000000000000000
r27 : F100008080000280 r28 : F100008090000080 r29 : F1000097140C1A00
r30 : F1000097140AB600 r31 : 0000000000000004
s0 : 0000000000000000 s1 : 000000000FFFFFFF s2 : 000000000FFFFFFF
s3 : 000000000FFFFFFF s4 : 000000000FFFFFFF s5 : 000000000FFFFFFF
s6 : 000000000FFFFFFF s7 : 000000000FFFFFFF s8 : 000000000FFFFFFF
s9 : 000000000FFFFFFF s10 : 000000000FFFFFFF s11 : 000000000FFFFFFF
s12 : 000000000FFFFFFF s13 : 000000000FFFFFFF s14 : 000000000FFFFFFF
s15 : 000000000FFFFFFFprev
0000000000000000 stackfix F00000002FF3B200
kjmpbuf 0000000000000000 excbranch 0000000000000000
intpri 00 backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu
00 fpinfo 00 alloc F000 ptaseg F100000050000000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000042000000 bit set: DSISR_PFT DSISR_ST
esid 000000002000796E dar F10000971F15700C dsirr 0000000000000106
(0)> p * -s SACTIVE <== display all active process
SLOT NAME STATE
PID PPID
ADSPACE CL #THS
pvproc+000000 0 swapper ACTIVE 0000000 0000000 0000000000000B00 0 0001
pvproc+000280 1 init ACTIVE 0000001 0000000 000000000000E2FD 0 0001
pvproc+000500 2 wait ACTIVE 0000204 0000000 0000000000001B02 0 0001
pvproc+000780 3 wait ACTIVE 0000306 0000000 0000000000002B04 0 0001
pvproc+000A00 4 lrud ACTIVE 0000408 0000000 0000000000003B06 65 0001
pvproc+000C80 5 xmgc ACTIVE 000050A 0000000 000000000000BB16 65 0001
pvproc+000F00 6 netm ACTIVE 000060C 0000000 000000000000CB18 65 0001
pvproc+001180 7 gil
ACTIVE 000070E 0000000 000000000000DB1A 65 0005
.
.
(0)> th -w NOWAIT <== display all thread that wait for nothing
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000180
3!wait
RUN 000307 0FF 1 00001 0
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-55 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB process/thread sub commands
example
continued
-- continued
(0)> th 3 <== now display details on thread 3
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000180 3>wait RUN 000307 0FF 1 00001 0
NAME................ wait
FLAGS............... KTHREAD
WTYPE............... WCPU
.................tid :0000000000000307 ......tsleep :FFFFFFFFFFFFFFFF
...............flags :00001000 ..............flags2 :00000000
DATA.........pvprocp :F100008080000780 <pvproc+000780>
LINKS.....prevthread :F100008090000180 <pvthread+000180>
..........nextthread :F100008090000180 <pvthread+000180>
DISPATCH.......synch :FFFFFFFFFFFFFFFF
SCHEDULER...affinity :00000001 .................pri :000000FF
.............boosted :00000000 ...............wchan :0000000000000000
...............state :00000002 ...............wtype :00000004
CHECKPOINT......vtid :00000000
LOCK........ lock_d @ F100008090000190 0000000000000000
PROCFS......procfsvn :0000000000000000
THREAD.......threadp :F1000097140AB000 ........size :00000080
FLAGS............... SIGAVAIL KTHREAD FUNNELLED SIGSLIH SIGINTR.................tid :0000000000000307 ......stackp
:0000000000000000
.................scp :0000000000000000 .......ulock :0000000000000000
...............uchan :0000000000000000 ....userdata :0000000000000000
..................cv :0000000000000000 .......flags :0000000000003004
..............atomic :0000000000000000 ......flags2 :0000000000000000
DATA...........procp :F1000097140ABE00 <KERN_heap+40ABE00>
...........pvthreadp :F100008090000180 <pvthread+000180>
...............userp :F00000002FF3B898 <__ublock+000498>
............uthreadp :F00000002FF3B400 <__ublock+000000>
SLEEP/LOCK......usid :0000000000000000 ......wchan1 :0000000000000000
..............wchan2 :0000000000000000 ......swchan :0000000000000000
...........eventlist :0000000000000000 ......result :00000000
.............polevel :00000000 ..............pevent :0000000000000000
..............wevent :0000000000000000 .......slist :0000000000000000
...........wchan1sid :0000000000000000 wchan1offset :00000000
...........lockcount :00000000 ..........adsp_flags :0000
DISPATCH.......ticks :0000BC2C ...............prior :F1000097140AB000
................next :F1000097140AB000 ......dispct :00000000008B4EF3
...............fpuct :0000000000000000
MISC........graphics :0000000000000000 ...pmcontext :0000000000000000
...........lockowner :0000000000000000 ..kthreadseg :0000000107FFFFFF
..........time_start :0000000000000000 ..........wlm_charge :0SIGNAL........sigproc:00000000 ..............cursig :00000000
......(pending) sig :[3] 0000000000000000 .................[2] 0000000000000000
......................[1] 0000000000000000 .................[0] 0000000000000000
............sigmask :[3] 0000000000000000 .................[2] 0000000000000000
......................[1] 0000000000000000 .................[0] 0000000000000000
SCHEDULER......cpuid :00000001 ..............scpuid :00000001
.........affinity_ts :0006A57F ..............policy :00000001
.................cpu :00000078 .............lockpri :00000000
.............wakepri :000000FF ................time :00000000
.............sav_pri :000000FF ...........run_queue :F1000097140A2000
................cpu2 :00000078
.............suspend :00000001 .............fsflags :00000000
..........norun_secs :00000000
CHECKPOINT..chkerror :0000
............chkblock :00000000
PROCFS.......whystop :00000000 ............whatstop :00000000
..............weight :00000008 ........allowed_cpus :C0000000
.......prefunnel_cpu :00000000
......threadcontrolp :0000000000000000...........controlvm :0000000000000000
PVTHREAD...pvthreadp :F100008090000180 ........size :00000080
Continued on next page
-56 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB process/thread sub commands
example
continued
Guide
-- continued
(0) ttid 70e <== now display threads for gil(70e) that should have 5 threads
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000380 7 gil
SLEEP 00070F 025 1
65
pvthread+000580 11 gil
SLEEP 000B17 025 1
65 netisr_servers
pvthread+000500 10 gil
SLEEP 000A15 025 1
65 netisr_servers
pvthread+000480 9 gil
SLEEP 000913 025 1
65 netisr_servers
pvthread+000400 8 gil
SLEEP 000811 025 1
65 netisr_servers
(0)> user -ad 5 <== display address space for thread 5
User-mode address space mapping:
segs32_raddr.0000000000000000
uadspace node allocation......(U_unode) @ F00000002FF3E028
usr adspace 32bit process.(U_adspace32) @ F00000002FF3E048
segment node allocation.......(U_snode) @ F00000002FF3E008
segnode for 32bit process...(U_segnode) @ F00000002FF3E2A8
U_adspace_lock @ F00000002FF3E4E8
lock_word.....0000000000000000 vmm_lock_wait.0000000000000000
V_USERACC strtaddr:0x0000000000000000 Size:0x0000000000000000
vmmflags......00000000
(0)> sw 5 <== switch to the thread 5
Switch to thread: <pvthread+000280>
(0)> tpid <== display the current tpid that should be slot 5
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+000280 5*xmgc SLEEP 00050B 03C 1
65 KERN_heap+ECD5730
(0)> sw <== switch back to initial thread
Switch to initial thread: <pvthread+001200>
(0)> tpid <== display the current tpid that should be initial pvthread+001200
SLOT NAME STATE TID PRI RQ CPUID CL WCHAN
pvthread+001200 36*kdb_64 RUN 002467 03C 0
0
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-57 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB Kernel stack sub commands
Introduction
The following table represents the Kernel stack sub commands and their matching
crash/lldb sub commands when available
kernel stack
function
trace a kernel stack
f sub command
crash/lldb
sub commands
fs
KDB sub
commands
f
kdb sub
commands
f
The stack sub command displays all the stack frames from the current instruction
as deep as possible. Interrupts and system calls are crossed and the user stack is
also displayed. In the user space, trace back allows display of symbolic names.
The amount of data displayed may be controlled through the mst_wanted and
display_stacked_frames options of the set sub command. You can also request to
see the stacked registers using the display_stacked_regs set option.
The f sub command can be invoked using the following :
• no argument : the stack for the current thread is displayed.
• +x : flag to include hex addresses as well as symbolic names for calls on the
stack. This option remains set for future invocations of the stack subcommand,
until changed via the -x flag.
• -x : flag to suppress display of hex addresses for functions on the stack. This
option remains in effect for future invocations of the stack subcommand, until
changed via the +x flag.
• tslot : decimal value indicating the thread slot number
• Address : hex address, hex expression, or symbol indicating the effective
address for a thread slot
Continued on next page
-58 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB Kernel stack sub commands
examples
Guide
-- continued
(0)> f +x <== display the stack frame for the current thread
pvthread+000380 STACK:
[0002CB58]et_wait+00036C (0000000000212A0C, A0000000000010B2,
0000000000122A0C [??])
[000EF170]netthread_start+0000B8 ()
[00060F6C]procentry+000010 (??, ??, ??, ??)
(0)> f -x <==display the stack frame without addresses
pvthread+000380 STACK:
et_wait+00036C (.backt+000000, A0000000000010B2,
.v_prepin+000000 [??])
netthread_start+0000B8 ()
procentry+000010 (??, ??, ??, ??)
(0) set 10 <== want to see the stacked registers
display_stacked_regs is true
(0)> f <== show the stack frame with stacked registers
pvthread+000380 STACK:
et_wait+00036C (.backt+000000, A0000000000010B2,
.v_prepin+000000 [??])
r31 : 0000000000000000 r30 : 0FFFFFFFF0100000 r29 : 0000000000205E38
r28 : 00000000DEADBEEF r27 : 00000000DEADBEEF r26 : 00000000DEADBEEF
r25 : 00000000DEADBEEF r24 : 00000000DEADBEEF r23 : 00000000DEADBEEF
r22 : 00000000DEADBEEF r21 : 00000000DEADBEEF r20 : 00000000DEADBEEF
r19 : 00000000DEADBEEF r18 : 00000000DEADBEEF r17 : 00000000DEADBEEF
r16 : 00000000DEADBEEF r15 : 00000000DEADBEEF r14 : 00000000DEADBEEF
netthread_start+0000B8 ()
r31 : 00000000DEADBEEF r30 : 00000000DEADBEEF r29 : 00000000DEADBEEF
procentry+000010 (??, ??, ??, ??)
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-59 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB LVM sub commands
Introduction
The following table represents the LVM sub commands and their matching crash/
lldb sub commands when available
LVM
function
display physical buffer
display volume group
display physical volume
display logical volume
volgrp,pvol, lvol
and pbuf sub
command
crash/lldb
sub
commands
KDB sub
commands
pbuf
volgrp
pvol
lvol
kdb sub
commands
pbuf
volgrp
pvol
lvol
volgrp, pvol, lvol and pbuf will respectively display :
• volume group information (including lvol structures). volgrp addresses are
registered in the devsw table, in the DSDPTR field.
• physical volume information. pvol addresses are registered within the vlogrp
structure.
• logical volume information. lvol addresses are registered within the volgrp and
lvol structures.
• physical buffer information. pbuf addresses are registered withing volgrp and
pvol structures.
All lvm sub commands takes addresses as parameters.
Continued on next page
-60 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB LVM sub commands
examples
Guide
-- continued
(0)> dev 0xa <== get the device switch table entry for a volume group
Slot address F1000097140C3500
MAJOR: 00A
.
.
dump:
010E3D00
mpx:
.nodev (0009E378)
revoke: .nodev (0009E378)
dsdptr: F10000971660D000 <== the pointer to the volgrp structure
selptr: 00000000
opts:
0000002A
DEV_DEFINED DEV_MPSAFE
(0)> volgrp F10000971660D000
VOLGRP............. F10000971660D000
vg_lock............... FFFFFFFFFFFFFFFF partshift............. 0000000E
open_count............ 0000000A flags................. 00000000
lvols............... @ F10000971660D010 <== pointer to the lvol struct
pvols............... @ F10000971660E010 <== pointer to the pvol struct
major_num............. 0000000A
vg_id................. 0007148300004C00000000E12335DF7D
nextvg................ 00000000 opn_pin............. @ F10000971660E428
von_pid............... 00000A32 nxtactvg.............. 00000000
ca_freepvw............ 00000000 ca_pvwmem............. 00000000
ca_hld.............. @ F10000971660E488 ca_pv_wrt........... @ F10000971660E4A0
.
.
(0)> lvol F10000971E624E00 <== display on of the lvol structure
LVOL............ F10000971E624E00
work_Q.......... 00000000 lv_status....... 00000000
lv_options...... 00001000 nparts.......... 00000001
i_sched......... 00000000 nblocks......... 00034000
parts[0]........ F10000971E621A00
pvol@ F1000097163DF200 <== pointer to pvol structure
.............dev 8000000E00000001 start 002C9100
parts[1]........ 00000000
parts[2]........ 00000000
maxsize......... 00000000 tot_rds......... 00000000
complcnt........ 00000000 waitlist........ FFFFFFFF
stripe_exp...... 00000000 striping_width.. 00000000
lvol_intlock. @ F10000971E624E60 lvol_intlock.... 00000000
(0)> pvol F1000097163DF200 <== now display the pvol
PVOL............... F1000097163DF200
dev................ 8000000E00000001 xfcnt.............. 00000000
armpos............. 00000000 pvstate............ 00000000
pvnum.............. 00000000 vg_num............. 0000000A
fp................. F1000096000022F0 flags.............. 00000000
num_bbdir_ent...... 00000000 fst_usr_blk........ 00001100
beg_relblk......... 00867C2D next_relblk........ 00867C2Dl
max_relblk......... 00867D2C defect_tbl......... F1000097165F4C00
ca_pv............ @ F1000097163DF250 sa_area[0]....... @ F1000097163DF260
sa_area[1]....... @ F1000097163DF270
pv_pbuf.......... @ F1000097163DF280 <== pointer to pbuf
oclvm............ @ F1000097163DF3C8
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-61 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB SCSI sub commands
Introduction
The following table represents the scsi sub commands and their matching crash/
lldb sub commands when available
SCSI
function
display ascsi
display vscsi
display scdisk
asc,vsc and csd
sub commands
crash/
lldb
sub
comma
nds
N/A
N/A
N/A
KDB
kdb
sub
sub
comma comma
nds
nds
asc
vsc
scd
asc
vsc
scd
The asc,vsc and csd sub commands respectively prints:
• ascsi adaptesr informations : the ascsiddpin kernext is used to locate the
adp_ctrl structure
• vscsi adapters informations : the vscsiddpin kernext is used to locate the
vscsi_ptrs structure
• scdisk disk informations L the scdiskpin kernext is used to locate the
scdisk_list structure
•
If no argument is specified the asc subcommand loads the slot numbers with
addresses from the adp_ctrl structure. The asc,vsc sub commands can use the
following arguments:
• no argument : prompt for the structure address.
• slot : slot number of the adp_ctrl,vscsi_ptrs or scdisk_list entry to be displayed.
This value must be a decimal number.
• Address : effective address of the structure to display. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
Continued on next page
-62 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB SCSI sub commands
Examples
(0)> lke 57
ADDRESS
Guide
-- continued
FILE FILESIZE FLAGS MODULE NAME
57 04E39480 01237AC0 00008958 00000262 /etc/drivers/ascsiddpin
le_flags....... TEXT DATAINTEXT DATA DATAEXISTS
le_next........ 04E39400 le_fp.......... 00000000
le_filename.... 04E394D8 le_file........ 01237AC0
le_filesize.... 00008958 le_data........ 0123FE60
(0)> d 0123FE60 80
0123FE60: 0123 EE3C 0123 EE38 0123 EE34 0123 EE30 .#.<.#.8.#.4.#.0
0123FE70: 0123 EE2C 0123 EE28 0123 EE24 0123 EE20 .#.,.#.(.#.$.#.
0123FE80: 0123 EE80 0123 EEC0 0123 EF00 0123 EF40 .#...#...#...#.@
0123FE90: 0123 EF80 0123 EFC0 0123 F000 0123 F040 .#...#...#...#.@
0123FEA0: 0123 F080 0123 F0C0 0123 F100 0123 F140 .#...#...#...#.@
0123FEB0: 0123 F180 0123 F1C0 0123 F200 0123 F240 .#...#...#...#.@
0123FEC0: 0000 0000 0000 0002 0000 0002 5002 D000 ............P...
0123FED0: 5002 E000 0000 0000 0000 0000 0000 0000 P...............
(0)> asc <== run asc and enter the address we found previously
Unable to find <adp_ctrl>
Enter the adp_ctrl address (in hex): 0123FEC0
Adapter control [0123FEC0]
semaphore............00000000num_of_opens.........00000002
num_of_cfgs..........00000002
ap_ptr[ 0]...........5002D000
ap_ptr[ 1]...........5002E000
.
.
(0)> asc 1 <== now that asc was ran once, we can use slot numbers
Adapter info [5002E000]
ddi.resource_name.....
ascsi1
intr............... @ 5002E000 ndd...................506FC020
seq_number............00000001 next..................00000000
local.............. @ 5002E1A4 ddi................ @ 5002E1D0
active_head...........00000000 active_tail...........00000000
wait_head.............00000000 wait_tail.............00000000
num_cmds_queued.......00000000 num_cmds_active.......00000000
adp_pool..............506C3128 surr_ctl........... @ 5002E22C
sta................ @ 5002E27C time_s.tv_sec.........00000000
time_s.tv_nsec........00000000 tcw_table.............506C3F9C
opened................00000001 adapter_mode..........00000001
adp_uid...............00000002 peer_uid..............00000000
sysmem................506C0000 sysmem_end............506C3FAD
busmem................00654000 busmem_end............00658000
tm_tcw_table..........00000000 eq_raddr..............00654000
dq_raddr..............00655000 eq_vaddr..............506C0000
dq_vaddr..............506C1000 sta_raddr.............00656000
sta_vaddr.............506C2000 bufs..................00658000
tm_sysmem.............00000000 wdog............... @ 5002E344
tm................. @ 5002E360 delay_trb.......... @ 5002E37C
xmem............... @ 5002E3B8 dma_channel...........04001000
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-63 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB SCSI sub commands
Example
continued
-- continued
mtu...................00141000
num_tcw_words.........00000011shift.................00000000 tcw_word..............00000000
resvd1................00000000 cfg_close.............00000000
vpd_close.............00000000 locate_state..........00000004
locate_event..........FFFFFFFF rir_event.............FFFFFFFF
vpd_event.............FFFFFFFF eid_event.............FFFFFFFF
ebp_event.............FFFFFFFF eid_lock..............FFFFFFFF
recv_fn...............0124024C tm_recv_fn............00000000
tm_buf_info...........00000000 tm_head...............00000000
tm_tail...............00000000 tm_recv_buf...........00000000
tm_bufs_tot...........00000000 tm_bufs_at_adp........00000000
tm_bufs_to_enable.....00000000 tm_buf................00000000
tm_raddr..............00000000 proto_tag_e...........00000000
proto_tag_i...........00000000 adapter_check.........00000000
eid................ @ 5002E42C limbo_start_time......00000000
dev_eid............ @ 5002E4B0 tm_dev_eid......... @ 5002E8B0
pipe_full_cnt.........00000000 dump_state............00000000
pad...................00000000 adp_cmd_pending.......00000000
reset_pending.........00000000 epow_state............00000000
mm_reset_in_prog......00000000 sleep_pending.........00000000
bus_reset_in_prog.....00000000 first_try.............00000001
devs_in_use_I.........00000000 devs_in_use_E.........00000000
num_buf_cmds..........00000000 next_id...............00000045next_id_tm............00000000
resvd4................00000000
ebp_flag..............00000000 tm_bufs_blocked.......00000000
tm_enable_threshold...00000000 limbo.................00000000
critical_path.........00000000 epow_reset_needed.....00000000
-64 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB memory allocator sub commands
Introduction
The following table represents the memory allocator sub commands and their
matching crash/lldb sub commands when available
memory allocator
function
display kernel heap
display heap debug
display kmem buckets
display kmem statistics
crash/lldb
sub commands
xmalloc
mblk
KDB sub
commands
heap
xm
kmbucket
kmstats
kdb sub
commands
heap
xm
kmbucket
kmstats
kmstats sub
command
The kmstats sub command prints kernel allocator memory statistics. If no address
is specified, all kernel allocator memory statistics are displayed. If an address is
entered, only the specified statistics entry is displayed.
kmbuckets sub
command
The kmbucket sub command prints kernel memory allocator buckets. If no
arguments are specified information is displayed for all allocator buckets for all
CPUs. kmbucket accept the following parameters :
• -l - display the bucket free list.
• -c cpu - display only buckets for the specified CPU. The cpu is specified as a
decimal value.
• -i index - display only the bucket for the specified index. The index is specified
as a decimal value.
• Address - display the allocator bucket at the specified effective address.
Symbols, hexadecimal values, or hexadecimal expressions may be used in
specification of the address.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-65 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB memory allocator sub commands
xm sub
command
-- continued
The xmalloc subcommand may be used to display memory allocation information.
Other than the -u option, these subcommands require that the Memory Overlay
Detection System (MODS) is active. The MODS can be activated using the
bosdebug command.
• -s : display allocation records matching addr. If Address is not specified, the
value of the symbol Debug_addr is used.
• -h : display free list records matching addr. If Address is not specified, the
value of the symbol Debug_addr is used.
• -l : enable verbose output. Applicable only with flags -f, -a, and -p.
• -f : display records on the free list, from the first freed to the last freed.
• -a : display allocation records.
• -p page : display page information for the specified page. The page number is
specified as a hexadecimal value.
• -d : display the allocation record hash chain associated with the record hash
value for Address. If Address is not specified, the value of the symbol
Debug_addr is used.
• -v : verify allocation trailers for allocated records and free fill patterns for free
records.
• -u : display heap statistics.
• -S : display heap locks and per-cpu lists. Note, the per-cpu lists are only used
for the kernel heaps.
• Address : effective address for which information is to be displayed. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address.
• heap_addr : effective address of the heap for which information is displayed. If
heap_addr is not specified, information is displayed for the kernel heap.
Symbols, hexadecimal values, or hexadecimal expressions can be used in
specification of the address.
heap sub
command
The heap subcommand displays information about heaps. If no argument is
specified information is displayed for the kernel heap. Information can be
displayed for other heaps by specifying an address of a heap_t structure.
Continued on next page
-66 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB memory allocator sub commands
Examples
-- continued
(0)> heap <== display kernel heaps
Pinned heap 00730290
sanity......... 4E554D41 alt............ 00000001
heapaddr[00]... F100009710000000 [01]..
0
heapaddr[02]...
0 [03]..
0
baseaddr[00]... F100009713FF3000 [01]..
0
baseaddr[02]...
0 [03]..
0
numpages[00]...
1C00D [01]..
0
numpages[02]...
0 [03]..
0
Kernel heap 007302F8
sanity......... 4E554D41 alt............ 00000000
heapaddr[00]... F1000097100000D8 [01]..
0
heapaddr[02]...
0 [03]..
0
baseaddr[00]... F100009713FF3000 [01]..
0
baseaddr[02]...
0 [03]..
0
numpages[00]...
1C00D [01]..
0
numpages[02]...
0 [03]..
0
(0)> xm -S F1000097100000D8 <== display heap lock/cpu for kernel heap 007302F8
Locks:
Lock for allocation size 16: F100009710000248 Available
Lock for allocation size 32: F1000097100002C8 Available
Lock for allocation size 64: F100009710000348 Available
Lock for allocation size 128: F1000097100003C8 Available
Lock for allocation size 256: F100009710000448 Available
Lock for allocation size 512: F1000097100004C8 Available
Lock for allocation size 1024: F100009710000548 Available
Lock for allocation size 2048: F1000097100005C8 Available
Heap lists:
CPU List # Unpinned Pinned
0
0
0
0
0
1
0
0
.
.
0
9
0
0
0 10
0
0
0 11
2322A000
0
1
0
0
0
.
.
(0)> kmstats <== display all the kernel allocator memory stats
mh_freelater ............0000000000E3E830
displaying kmemstats for offset 0 free
address...............F100009715FB46E0 inuse..(x)............0000000000000000
calls..(x)............0000000000000000 memuse..(x)...........0000000000000000
limit blocks..(x).....0000000000000000 map blocks..(x).......0000000000000000
maxused..(x)..........0000000000000000 limit..(x)............0000000000000000
failed..(x)...........0000000000000000 lock............... @ F100009715FB4728
lock..(x).............0000000000000000
.
.
.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-67 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB memory allocator sub commands
Examples
continued
-- continued
(0)> kmbucket <== display all kernel memory allocator buckets
displaying kmembucket for cpu 0 offset 5 size 0x00000020
address...............F100009715FA4C48 b_next..(x)...........F1000082C007BB80
b_calls..(x)..........0000000000000026 b_total..(x)..........0000000000000080
b_totalfree..(x)......000000000000005D b_elmpercl..(x).......0000000000000080
b_highwat..(x)........00000000000003F5 b_couldfree (sic).(x).0000000000000000
b_failed..(x).........0000000000000000 lock............... @ F100009715FA4C90
lock..(x).............0000000000000000
displaying kmembucket for cpu 0 offset 6 size 0x00000040
.
.
-68 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB file system sub commands
Introduction
The following table represents the file system sub commands and their matching
crash/lldb sub commands when available
file system
function
display buffer
display buffer hash table
display freelist
display gnode
display gfs
display file
display inode
display inode hash table
display inode cache list
display rnode
display vnode
display vfs
display specnode
display devnode
display fifo node
display hnode hash table
buffer,hbuffer
and fbuffer sub
command
crash/lldb
sub commands
buffer
file
inode
vnode
vfs
KDB sub
commands
buffer
hbuffer
fbuffer
gnode
gfs
file
inode
hinode
icache
rnode
vnode
vfs
specnode
devnode
fifonode
hnode
kdb sub
commands
buffer
hbuffer
fbuffer
gnode
gfs
file
inode
hinode
icache
N/A
vnode
vfs
specnode
devnode
fifonode
hnode
The buffer,hbuffer and fbuffer sub command respectivelly prints :
• buffer cache headers.
• buffer cache hash list headers.
• buffer cache freelist headers.
If no argument is specified a summary is printed. Details can be displayed by
selecting a slot number or an address using :
• slot : a buffer pool slot number. This argument must be a decimal value.
• Address : effective address of a buffer pool entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-69 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB file system sub commands
inode, hinode
and icache sub
commands
-- continued
The inode, Hinode and Icache respectively displays :
• inode table entries. If no argument is entered a summary for used (hashed)
inode table entries is displayed.
• inode hash list entries.
• inode cache list entries.
These sub commands use the following arguments :
• slot : slot number of an entry. This argument must be a decimal value.
• Address : effective address of an entry. Symbols, hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
gnode, vnode,
specnode,
devnode,
fifonode, rnode
and hnode sub
commands
gnode, vnode, specnode, devnode, fifonode, rnode and hnode sub commands
respectively displays :
•
•
•
•
•
•
•
generic node structure at the specified address.
virtual node (vnode) table entries.
special device node structure at the specified address.
device node (devnode) table entries.
fifo node table entries.
remote node structure at the specified address.
hash node table entries.
These sub commands accept the following arguments :
• slot : slot number of a f table entry. This argument must be a decimal value.
• Address : effective address of a table entry. Symbols, hexadecimal values, or
hexadecimal expressions can be used in specification of the address.
vfs sub
command
The vfs subcommand displays entries of the virtual file system table. If no
argument is entered a summary is displayed with one line for each entry. Detailed
information can be obtained for an entry by identifying the entry of interest.
Individual entries can be displayed using :
• slot : slot number of a virtual file system table entry. This argument must be a
decimal value.
• Address : address of a virtual file system table entry. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
Continued on next page
-70 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB file system sub commands
Guide
-- continued
gfs sub
command
The gfs subcommand displays the generic file system structure at the specified
address.
file sub
command
The file subcommand displays file table entries. If no argument is entered all file
table entries are displayed in a summary. Used files are displayed first (count > 0),
then others. Detailed information can be displayed using :
• slot : slot number of a file table entry. This argument must be a decimal value.
• Address : effective address of a file table entry. Symbols, hexadecimal values,
or hexadecimal expressions can be used in specification of the address.
Examples
(0)> vfs <== display mounted vfs
GFS
DATA TYPE
FLAGS
1 KERN_heap+5F7C470 00394EC8 F100009715F7D990 JFS DEVMOUNT
... /dev/hd4 mounted over /
2 KERN_heap+5F7C4D0 00394EC8 F100009715F7DE60 JFS DEVMOUNT
... /dev/hd2 mounted over /usr
3 KERN_heap+5F7C530 00394EC8 F100009715F7DD00 JFS DEVMOUNT
... /dev/hd9var mounted over /var
4 KERN_heap+5F7C410 00394EC8 F100009715F7D8E0 JFS DEVMOUNT
... /dev/hd3 mounted over /tmp
5 KERN_heap+5F7C590 00394EC8 F100009715F7DAF0 JFS DEVMOUNT
... /dev/hd1 mounted over /home
6 KERN_heap+5F7C5F0 00395008 0000000000000000 PROCFS
... /proc mounted over /proc
7 KERN_heap+5F7C650 00394F68 F10000971EB5A3D0 AIX DEVMOUNT
... /dev/lv01 mounted over /j2
(0)> gfs 0039500 <== display gfs for jfs entry
gfs_data. 706F7374FBE1FFF8 gfs_flag. SYS5DIR FUMNT VERSION42 NOUMASK
gfs_ops.. E981008038210070gn_ops... 7D8803A64E800020gfs_name. N
gfs_init. 00000054000E776Cgfs_rinit 607F00007C0802A6gfs_type.
gfs_hold. E8625080
(0)> file <== display the file table
ADDR
COUNT
OFFSET
DATA TYPE FLAGS
F100009600001080 1 0000000000000000 F1000097160CC2B0 VNODE WRITE NOCTTY
F1000096000010D0 1 0000000000000000 F1000082C0078800 SOCKET READ WRITE
F100009600001120 29 0000000000000000 F1000097159BB290 VNODE READ RSHARE
F100009600001170 2 0000000000000000 F100009714C89830 VNODE READ RSHARE
F1000096000011C0 34 0000000000026282 F100009714A01C60 VNODE READ RSHARE
F100009600001210 1 0000000000000100 F100009715696290 VNODE EXEC
F100009600001260 3 00000000000230E2 F100009714AA6620 VNODE READ RSHARE
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-71 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB system table sub commands
Introduction
The following table represents the system table sub commands and their matching
crash/lldb sub commands when available
system table
function
display var
display devsw table
display system timer request
blocks
display simple lock
display complex lock
search for deadlock
display ipl proc information
display trace buffer
display the stream queue
var sub
command
crash/lldb sub
commands
KDB sub
commands
kdb sub
commands
var
devsw
callout
var
devsw
trb
var
devsw
trb
lock -s
lock -c
slk
clk
slk
clk
dlock
N/A
iplcb
trace
streams
dla
iplcb
trace
streams
queue
The var subcommand prints the var structure and the system configuration of the
machine including :
•
•
•
•
Base kernel parameters
Calculated High-Water marks
VMM tunable variables
System configuration informations
Continued on next page
-72 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB system table sub commands
devsw sub
command
Guide
-- continued
The dev subcommand display device switch table entries. If no argument is
specified, all entries are displayed. To display a specific entry use :
• major - indicates the specific device switch table entry to be displayed by
the
• major number : This is the hexadecimal value of the device.
• Address : effective address of a driver. The device switch table entry with
the driver closest to the indicated address is displayed; and the specific
driver is indicated. Symbols, hexadecimal values, or hexadecimal
expressions can be used in specification of the address.
trb sub
command
The trb subcommand displays Timer Request Block (TRB) information. If this
subcommand is entered without arguments a menu is displayed allowing selection
of the data to be displayed. Otherwise, you can use the following arguments :
• * : selects display of Timer Request Block (TRB) information for TRBs on all
CPUs. The information displayed will be summary information for some
options.
• cpu x : selects display of TRB information for the specified CPU. Note, the
characters "cpu" must be included in the input. The value x is a hexadecimal
number.
• option - the option number indicating the data to be displayed. The available
option numbers are :
• 1. TRB Maintenance Structure - Routine Addresses
• 2. System TRB
• 3. Thread Specified TRB
• 4. Current Thread TRB's
• 5. Address Specified TRB
• 6. Active TRB Chain
• 7. Free TRB Chain
• 8. Clock Interrupt Handler Information
• 9. Current System Time - System Timer Constants
slk,clk and dla
sub commands
slk and clk display respectively simple and complex lock. If no argument is
specifyed, a list a major locks will be displayed. Then, you can use the address of
the lock to display the lock structure.
dla will search for deadlock.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-73 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB system table sub command
iplcb sub
command
-- continued
The iplcb sub command will display the IPL Control Block structure using the
following parameters :
• [cpu] to print IPL CB (will display all informations including cpu information
for [cpu].
• * : print summary of all processors
• -dir : print directory information
• -proc [cpu] : print processor information
• -mem : print memory region information
• -sys : print system information
• -user : print user information
• -numa : print NUMA information
trace sub
command
The trace sub command displays data in the kernel trace buffers. Data is entered
into these buffers via the shell subcommand trace. The trace sub command accept
the following parameters :
• -h : display trace headers.
• -c chan : select the trace channel for which the contents are to be monitored.
The value for chan must be a decimal constant in the range 0 to 7.
• hook : a hexadecimal value specifying the hook IDs to report on.
• :subhook : allows specification of subhooks, if needed. The subhooks are
specified as hexadecimal values.
Examples
(0)> !ls -al /dev/cd0 <== find the cd0 major number
br--r--r-- 1 root system 14, 0 Sep 08 11:18 /dev/cd0
(0)> lke 57 <== load the kernext for scsidd
ADDRESS FILE FILESIZE FLAGS MODULE NAME
57 049D6B00 00DB9740 000070D8 00080262 s_scsidd64/usr/lib/drivers/pci/s_scsidd
le_flags....... TEXT DATAINTEXT DATA DATAEXISTS 64
le_next........ 049D6900 le_svc_sequence 00000000.
.
.
.
.
Continued on next page
-74 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB system table sub command
Example
continued
Guide
-- continued
(0)> dev 0xd <== display the cd0 device
Slot address F10000971406F680
MAJOR: 00D
open: .ssc_open (00DBC0B0)
close: .ssc_close (00DBEAD8)
read: .nodev (00059694)
write: .nodev (00059694)
ioctl: .ssc_ioctl (00DBD1DC)
strategy: .ssc_strategy (00DC3C2C)
ttys: 00000000
select: .nodev (00059694)
config: .ssc_config (00DBE180)
print: .nodev (00059694)
dump: .ssc_dump (00DCDEF4)
mpx:
.nodev (00059694)
revoke: .nodev (00059694)
dsdptr: 00000000
selptr: 00000000
opts: 0000002A
DEV_DEFINED DEV_MPSAFE
(0)> trb cpu 1 7 <== display the trb free list for cpu 1
CPU #1 TRB #1 of 13 on Free List
Timer address..............F100009715F8B780
trb->to_next...............0000000000000000
trb->knext.................F10000971E27AD00
trb->kprev.................0000000000000000
Owner id (-1 for dev drv)..00000000000042A1
Owning processor...................00000001
Timer flags........................00000010 INCINTERVAL
trb->timerid...............0000000000000000
trb->eventlist.............FFFFFFFFFFFFFFFF
trb->timeout.it_interval...0000000000000000 sec. 00000000 nsec.
Next scheduled timeout ....0000000039BE55A6 sec. 19B39935 nsec.
Completion handler.........00000000001DA910 .rtsleep_end+000000
Completion handler data....F100009715F8B7B0
Int. priority .....................FFFFFFFF
Timeout function...........0000000000000000
CPU #1 TRB #2 of 13 on Free List
.
(0)> iplcb -mem <== display the iplcb memory region information
Memory information [10008AAC]
SLOT
ADDR
SIZE NODE ATTR LABEL
0 0000000000000000 0000000000FF1000 0 VirtAddr FreeMem
1 0000000000FF1000 000000000000F000 0 VirtAddr RMALLOC
2 0000000001000000 0000000006FCC000 0 VirtAddr FreeMem
3 0000000007FCC000 0000000000029000 0 None RTAS_HEAP
4 0000000007FF5000 000000000000B000 0 VirtAddr IPLCB
5 0000000008000000 0000000018000000 0 VirtAddr FreeMem
6 0000000020000000 FFFFFFFFE0000000 0 None IO_SPACE
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-75 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB system table sub command
Example
continued
-- continued
(0)> trace <== show the trace buffers trace was started for proc events.
Trace channel[0 - 7]: 0
Trace Channel 0 (7 entries)
Current queue starts at F1000097231F2000 and ends at F100009723232000
Current entry is #7 of 7 at F1000097231F2130
Hook ID: SYSC_EXECVE (00000134) Hook Type: Timestamped|Generic C000
ThreadIdent: 00003F0B
Timestamp: 26E264B2F6
Subhook ID/HookData: 0000
Data Length: 0007 bytes
D0: 00000001
*Variable Length Buffer: F1000097231F2140
Current queue starts at F1000097231F2000 and ends at F100009723232000
Current entry is #6 of 7 at F1000097231F2108
.
.
-76 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB network sub commands
Introduction
The following table represents the network sub commands and their matching
crash/lldb sub commands when available
network
function
display interface
display TCBs
display UDBs
display sockets
display TCP CB
display mbuf
ifnet sub
command
crash/lldb sub
commands
netstat
ndb
ndb
sock
ndb
mbuf
KDB sub
commands
ifnet
tcb
udb
sock
tcpcb
mbuf
kdb sub
commands
ifnet
tcb
udb
sock
tcpcb
mbuf
The ifnet sub command prints interface information. If no argument is specified,
information is displayed for each entry in the ifnet table. Data for individual
entries can be displayed by specifying :
• slot : specifies the slot number within the ifnet table for which data is to be
displayed. This value must be a decimal number.
• Address : effective address of an ifnet entry to display. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
tcpcb and sock
sub command
The tcpcb and socket sub commands respectively prints:
• tcpcb information for TCP/UDP blocks.
• socket information for TCP/UDP blocks.
•
If no argument is specified tcpcb information is displayed for all TCP and UDP
blocks. tcpcb and sock accept the following command :
• tcp : display tcpcb information for TCP blocks only.
• udp : display tcpcb information for UDP blocks only.
• Address - effective address of a tcpcb structure to be displayed. Symbols,
hexadecimal values, or hexadecimal expressions can be used in specification of
the address
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-77 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB network sub commands
tcb and udb sub
commands
-- continued
tcb and udb sub commands can be used respectively to display :
• tcb block information + socket information
• udb block information + socket information
tcb and udb sub commands accept the following parameters :
• slot : specifies the slot number within the b table for which data is to be
displayed. This value must be a decimal number.
• Address : effective address of a udb entry to display. Symbols, hexadecimal
values, or hexadecimal expressions can be used in specification of the address.
Examples
(0)> ifnet
SLOT 1 ---- IFNET INFO ----(@ 007545E0)---name........ lo0
unit........ 00000000 mtu......... 00004200
flags....... 0E08084B
(UP|BROADCAST|LOOPBACK|RUNNING|SIMPLEX|NOECHO|BPF|GROUP_ROUTING...
...|64BIT|CANTCHANGE|MULTICAST)
timer....... 00000000 metric...... 00000000
address: 127.0.0.1
init()...... 00000000 output().... 001DBF38 start()..... 00000000
done()...... 00000000 ioctl()..... 001DBF20 reset()..... 00000000
watchdog().. 00000000 ipackets.... 000000B5 ierrors..... 00000000
opackets.... 000000B5 oerrors..... 00000000 collisions.. 00000000
next........ F10000971614F000 type........ 00000018 addrlen..... 00000000
hdrlen...... 00000000 index....... 00000001
ibytes...... 00003448 obytes...... 00003448 imcasts..... 00000000
omcasts..... 00000000 iqdrops..... 00000000 noproto..... 00000000
baudrate.... 00000000 arpdrops.... 00000000 ifbufminsize 00000000
devno....... 00000000 chan........ 00000000 multiaddrs.. F1000082C0157468
tap()....... 00000000 tapctl...... 00000000 arpres().... 00000000
arprev().... 00000000 arpinput().. 00000000 ifq_head.... 00000000
ifq_tail.... 00000000 ifq_len..... 00000000 ifq_maxlen.. 00000032
ifq_drops... 00000000 ifq_slock... 00000000 slock....... 00000000
multi_lock.. 00000000 6_multi_lock 00000000 addrlist_lck 00000000
gidlist..... 00000000 ip6tomcast() 00000000
ndp_bcopy(). 00000000
ndp_bcmp().. 00000000 ndtype...... 01000000 multiaddrs6. F1000082C0158F00
SLOT 2 ---- IFNET INFO ----(@ F10000971614F000)---name........ tr0
unit........ 00000000 mtu......... 000005D4
.
.
(0)> tcpcb @ F1000082C0031C34 <== display the first tcpcb
---- TCPCB ---(@ F1000082C0031C34)---seg_next... F1000082C0031C34 seg_prev...... F1000082C0031C34
t_softerror 00000000 t_state....... 00000004 (ESTABLISHED)
t_timer.... 00000000 (TCPT_REXMT)
t_timer.... 00000000 (TCPT_PERSIST)
t_timer.... 00000CFB (TCPT_KEEP)
t_timer.... 00000000 (TCPT_2MSL)
t_rxtshift. 00000000 t_rxtcur...... 00000004 t_dupacks..... 00000000
t_maxseg... 000005AC t_force....... 00000000
Continued on next page
-78 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB network sub commands
Example
continued
Guide
-- continued
t_flags.... 00000000 ()
t_oobflags. 00000000 ()
t_iobc..... 00000000 t_template.. F1000082C0031C64
t_inpcb..F1000082C0031B54 <== pointer to tcb or udb structure
t_timestamp... 2DF79401 snd_una....... 8D452AB5 snd_nxt....... 8D452AB5
snd_up........ 8D452920 snd_wl1....... 42612E19 snd_wl2....... 8D452AB5
iss........... 8D4514FA snd_wnd....... 00003E64 rcv_wnd....... 00004410
rcv_nxt....... 42612E1B rcv_up........ 42612E18 irs........... 42612D92
snd_wnd_scale. 00000000 rcv_wnd_scale. 00000000 req_scale_sent 00000000
req_scale_rcvd 00000000 last_ack_sent. 42612E1B timestamp_rec. 00000000
timestamp_age. 00002BE3 rcv_adv....... 4261722B snd_max....... 8D452AB5
snd_cwnd...... 0000DD34 snd_ssthresh.. 3FFFC000 t_idle........ 00002B45
t_rtt......... 00000000 t_rtseq....... 8D452920 t_srtt........ 00000007
t_rttvar...... 00000004 t_rttmin...... 00000002 max_rcvd...... 00000000 max_sndwnd.... 00003E64 t_peermaxseg..
000005AC
(0)> tcb f1000082C0031B54 <== display the tcb for the pointer found before
-------- TCB --------- INPCB INFO ----(@ F1000082C0031B54)---next........ F1000082C0032354 prev........ 04BB8F80 head........ 04BB8F80
iflowinfo... 00000000 faddr_6... @ F1000082C0031B74 fport....... 00008036
fatype...... 00000001 oflowinfo... 00000000 laddr_6... @ F1000082C0031B8C
lport....... 00000017 latype...... 00000001 socket...... F1000082C0031800
ppcb........ F1000082C0031C34 route_6... @ F1000082C0031BAC ifa.....00000000
flags....... 00000400 proto....... 00000000 tos......... 00000000
ttl......... 0000003C rcvttl...... 00000000 rcvif....... F10000971614F000
options..... 00000000 refcnt...... 00000002
lock........ 00000000 rc_lock..... 00000000 moptions.... 00000000
hash.next... 04BEB040 hash.prev... 04BEB040
timewait.nxt 00000000 timewait.prv 00000000
---- SOCKET INFO ----(@ F1000082C0031800)---- <== we also get socket information
type........ 0001 (STREAM)
opts........ 010C (REUSEADDR|KEEPALIVE|OOBINLINE)
linger...... 0000 state....... 0102 (ISCONNECTED|NBIO)
pcb.. F1000082C0031B54 proto.. 04BAC870 lock.. F1000082C007B740 head.00000000
q0...... 00000000 q....... 00000000 dq...... 00000000 q0len....... 0000
qlen........ 0000 qlimit...... 0000 dqlen....... 0000 timeo....... 0000
error....... 0000 special..... 0A8C pgid.... 00000000 oobmark. 00000000
snd:cc...... 00000000 hiwat... 00002000 mbcnt... 00000000 mbmax... 00008000
lowat... 00001000 mb...... 00000000
sel..... 00000000 events...... 0000
iodone. 00000000 ioargs. 00000000 lastpkt. F1000082C01BE800 wakeone. FFFFFFFF
timer... 00000000 timeo... 00000000 flags....... 0048 (SEL|NOINTR)
wakeup.. 00F66E78 wakearg. C01FF918 lock.... FFFFFFFFF1000082
rcv:cc...... 00000000 hiwat... 00004410 mbcnt... 00000000 mbmax... 00011040
lowat... 00000001 mb...... 00000000 sel..... 00000000 events...... 0004
iodone.. 00000000 ioargs.. 00000000 lastpkt. F1000082C01A9800 wakeone. FFFFF
FFF
timer... 00000000 timeo... 00000000 flags....... 0048 (SEL|NOINTR)
wakeup.. 00F66E78 wakearg. C01FF800 lock.... FFFFFFFFF1000082
tpcb.... 00000000 fdev_ch. F10000971E186DC0 sec_info 00000000 qos..... 00000
000
gidlist. 00000000 private. 00000000 uid..... 00000000 bufsize. 00000000
threadcnt00000000 nextfree 00000000 siguid.. 00000000 sigeuid. 00000000
sigpriv. 00000000
sndtime. 00000000 sec 00000000 usec rcvtime. 00000000 sec 00000000 usec
proc/fd: 44/0 44/1 44/2
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-79 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB VMM sub commands
Introduction
The following table represents the VMM sub commands and their matching crash/
lldb sub commands when available
VMM
function
VMM kernel segment data
VMM RMAP
VMM control variables
VMM statistics
VMM Addresses
VMM paging device table
VMM segment control blocks
VMM PFT entries
VMM PTE entries
VMM PTA segment
VMM STAB
VMM segment register
VMM segment status
VMM APT entries
VMM wait status
VMM address map entries
VMM zeroing kproc
VMM error log
VMM reload xlate table
IPC information
VMM lock anchor/tblock
VMM lock hash table
VMM lock word
VMM disk map
VMM spin locks
crash/lldb
sub commands
/vmm-1
vmm-rmap
/vmm-2
/vmm-3
/vmm-a
vmm-pdt
vmm-scb
vmm-pft
vmm-pte
vmm-pta
sr64
segst64
vmm-apt
/vmm-9
vmm-ame
/vmm-f
/vmm-e
vmm-sem/shm
KDB sub
commands
vmker
rmap
pfhdata
vmstat
vmaddr
pdt
scb
pft
pte
pta
ste
sr64
segst64
apt
vmwait
ames
zproc
vmlog
vrld
ipc
lockanch
lockhash
lockword
vmdmap
vmlocks
kdb sub
commands
vmker
rmap
pfhdata
vmstat
vmaddr
pdt
scb
pft
pte
pta
ste
sr64
segst64
apt
vmwait
ames
zproc
vmlog
vrld
ipc
lockanch
lockhash
lockword
vmdmap
vmlocks
Continued on next page
-80 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB VMM sub commands
-- continued
vmker, pfhdata,
vmstat, vmaddr,
vmwait, zproc,
vmlog, vrld and
vmlocks sub
commands
These sub commands will display VMM information about :
scb sub
command
The sub sub command provides options for display of information about VMM
segment control blocks. The scb sub command will prompt a menu to display scb
using the following options :
•
•
•
•
•
•
•
•
vmker : virtual memory kernel data.
pfhdata : virtual memory control variables.
vmstat : virtual memory statistics
vmaddr : addresses of VMM structures.
vmwait : displays VMM wait status using the address of a wait chanel.
zproc : displays information about the VMM zeroing kproc.
vmlog : displays the current VMM error log entry.
vrld : displays the VMM reload xlate table. This information is only used on
SMP PowerPC machine, to prevent VMM reload dead-lock.
• vmlocks : displays VMM spin lock data.
•
•
•
•
•
•
•
•
•
•
•
•
•
ames sub
command
1 : index
2 : sid
3 : srval
4 : search on sibits
5 : search on npsblks
6 : search on nvpages
7 : search on npages
8 : search on npseablks
9 : search on lock
a : search on segment type
b : add total scb_vpages
c : search on segment class
d : search on segment pvproc
The ames subcommand provides options for display of the process address map
for either the current or a specified processes. The ames sub command will prompt
a menu to display address map using the following options :
• 1 : current process
• 2 : specified process
• 3 : specified address map
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-81 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB VMM sub commands
pft sub
command
The pft sub command provides options for display of information about VMM
page frame table. The pft sub command will prompt a menu to display page frame
information using the following options :
•
•
•
•
•
•
•
•
pte sub
command
2 : h/w hash (sid,pno)
3 : s/w hash (sid,pno)
4 : search on swbits
5 : search on pincount
6 : search on xmemcnt
7 : scb list
8 : io list
9 : deferred pgsp service frames
The pte sub command provides options for display of information about VMM
page table entries . The pte sub command will prompt a menu to display scb using
the following options :
•
•
•
•
pta sub
command
-- continued
1 : index
2 : sid,pno
3 : page frame
4 : PTE group
The pta subcommand displays data from the VMM PTA segment. The following
optional arguments maybe used to determine the data to be displayed :
•
•
•
•
•
•
•
-r - to display XPT root data.
-d - to display XPT direct block data.
-a - to display the Area Page Map.
-v - to display map blocks.
-x - to display XPT fields.
-f - prompt for the sid/pno for which the XPT fields are to be displayed.
sid - segment ID. Symbols, hexadecimal values, or hexadecimal expressions
may be used for this argument.
• idx - index for the specified area. Symbols, hexadecimal values, or
hexadecimal expressions may be used for this argument.
pdt sub
command
The pdt subcommand displays entries of the paging device table. An argument of
* results in all entries being displayed in a summary. Details for a specific entry
can be displayed using a slot number.
Continued on next page
-82 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB VMM sub commands
Guide
-- continued
rmap sub
command
The rmap subcommand displays the real address range mapping table. If an
argument of * is specified, a summary of all entries is displayed. If a slot number
is specified, only that entry is displayed. If no argument is specified, the user is
prompted for a slot number, and data for that and all higher slots is displayed, as
well as the page intervals utilized by VMM.
ste sub
command
The ste subcommand provides options for display of information about segment
table entries for 64-bit processes. The ste sub command will prompt a menu to
display segments using the following options :
•
•
•
•
sr64 sub
command
1 :esid
2 : sid
3 : dump hash class (input=esid)
4 : dump entire stab
The sr64 sub command displays segment registers for a 64-bit process. Using the
following parameters :
• none : the segment registers will be displayed for the current process.
• -p pid : process ID of a 64-bit process. This must be a decimal or hexadecimal
value depending on the setting of the hexadecimal_wanted switch.
• esid : first segment register to display (lower register numbers are ignored).
This argument must be a hexadecimal value.
• size : value to be added to esid to determine the last segment register to display.
This argument must be a hexadecimal value.
apt sub
command
The apt subcommand provides options for display of information from the alias
page table.The apt sub command will prompt a menu to display aliases using the
following options :
• 1 : index
• 2 : sid,pno
• 3 : page frame
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-83 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB VMM sub commands
segst64 sub
command
-- continued
The segst64 subcommand displays segment state information for a 64-bit process.
The information display can be filtered using :
• no argument : the information for the current process is displayed.
• -p pid - process ID of a 64-bit process. This must be a decimal or hexadecimal
value depending on the setting of the hexadecimal_wanted switch.
• -e esid - first segment register to display (lower register numbers are ignored).
• -s seg - limit display to only segment register with a segment state that matches
seg. Possible values for seg are: SEG_AVAIL, SEG_SHARED,
• SEG_MAPPED, SEG_MRDWR, SEG_DEFER, SEG_MMAP,
SEG_WORKING, SEG_RMMAP, SEG_OTHER, SEG_EXTSHM, and
SEG_TEXT.
• value - limit display to only segments with the specified value for the segfileno
field.
ipc sub
command
The ipc subcommand reports interprocess communication facility information.
The ipc sub command will prompt a menu to display ipc using the following
options :
• ***TBD***
lockanch,
lockhash and
lockword sub
commands
These sub commands will display VMM lock information for :
• lockanch : anchor data and data for the transaction blocks in the transaction
block table.
• lockhash : lock hash list.
• lockword : lock words.
lockanch, lockhash and lockword accept the following parameters :
• slot : slot number of an entry in the VMM lock table. This argument must be a
decimal value.
• Address : effective address of an entry in the VMM lock table. Symbols,
hexadecimal values, or hexadecimal expressions may be used in specification
of the address.
Continued on next page
-84 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB VMM sub commands
vmdmap sub
command
Guide
-- continued
The vmdmap subcommand displays VMM disk maps. To look at other disk maps
it is necessary to initialize segment register 13 with the corresponding srval.
vmdmap accept the following arguments :
• no arguments : all paging and file system disk maps are displayed.
• slot : Page Device Table (pdt) slot number. This argument must be a decimal
value.
examples
***TBD***
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-85 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDB SMP sub commands
Introduction
The following table represents the SMP sub commands and their matching crash/
lldb sub commands when available
SMP
function
Start cpu
Stop cpu
Switch to cpu
crash/lldb sub
commands
cpu
KDB sub
commands
start
stop
cpu
start, stop and
cpu sub
commands
start, stop and cpu commands will allow you to :
Examples
***TBD***
kdb sub
commands
N/A
N/A
cpu
• start a cpu
• stop a cpu
• display status or switch to another cpu
These sub commands accept a cpu number as parameter.
-86 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB data and instruction block address translation sub commands
Introduction
The following table represents the block address translation sub commands and
their matching crash/lldb sub commands when available
block address
translation
function
display dbats
display ibats
modify dbats
modify ibtas
dbat and ibat
sub commands
crash/lldb sub
commands
KDB sub
commands
dbat
ibat
mdbat
mibat
kdb sub
commands
dbat
ibat
mdbat
mibat
On PowerPC machine, the dbat and ibat sub commands may be used to display
dbat and ibat registers. dbat and idat accept the following arguments :
• no argument : all dbat registers are displayed.
• index : just the specified dbat register is displayed.
mdbat and
mibat sub
commands
On PowerPC machine, the mdbat and mibat sub commands may be used to
modify dbat and ibat registers. The processor data bat register is altered
immediately. KDB takes care of the valid bit, the word containing the valid bit is
set last. mdbat and mibat accept the following arguments :
• no argument : all dbat or ibat registers are prompted for modification.
• index : just the specified dbat or ibat register is prompted for modification.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-87 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
KDBdataandinstructionblockaddresstranslationsubcommands -- continued
Examples
KDB(0)> dbat 2 <== display bat register 2
BAT2: 00000000 00000000
bepi 0000 brpn 0000 bl 0000 v 0 wimg 0 ks 0 kp 0 pp 0
KDB(0)> mdbat 2 alter bat register 2
BAT register, enter <RC> twice to select BAT field, enter <.> to quit
BAT2 upper 00000000 = <CR/LF>
BAT2 lower 00000000 = <CR/LF>
BAT field, enter <RC> to select field, enter <.> to quit
BAT2.bepi: 00000000 = 00007FE0
BAT2.brpn: 00000000 = 00007FE0
BAT2.bl : 00000000 = 0000001F
BAT2.v : 00000000 = 00000001
BAT2.ks : 00000000 = 00000001
BAT2.kp : 00000000 = <CR/LF>
BAT2.wimg: 00000000 = 00000003
BAT2.pp : 00000000 = 00000002
BAT2: FFC0003A FFC0005F
bepi 7FE0 brpn 7FE0 bl 001F v 1 wimg 3 ks 1 kp 0 pp 2
eaddr = FFC00000, paddr = FFC00000 size = 4096 KBytes
-88 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
KDB bat/brat sub commands
Introduction
The following table represents the bat/brat sub commands and their matching
crash/lldb sub commands when available
bat/brat
function
branch target
clear branch target
local branch target
clear local branch target
btac,lbtac, cbtac
and lcbtac sub
commands
crash/
KDB
lldb
sub
sub
comma
comma
nds
nds
btac
cbtac
lbtac
lcbtac
kdb
sub
comma
nds
N/A
N/A
N/A
N/A
The btac and lbtac sub commands can be used to stop when Branch Target
Address Compare is true using hardware registers HID1 and HID2 on PowerPC
systems in the following condictions :
• btac : set a general branch target
• lbtac : set a local branch target on a cpu basis.
lbtac and lcbtac respectively clear general and local branch targets.
Examples
KDB(0)> btac open <== set BRAT on open function
KDB(7)> btac <== display current BRAT status
CPU 0: .open+000000 eaddr=001B5354 vsid=00000000 hit=0
CPU 1: .open+000000 eaddr=001B5354 vsid=00000000 hit=0
KDB(0)> q <== exit the debugger
...
Branch trap: 001B5354 <.open+000000>
.sys_call+000000 bcctrl
<.open>
KDB(0)> btac <== display current BRAT status (we have one hit)
CPU 0: .open+000000 eaddr=001B5354 vsid=00000000 hit=1
CPU 1: .open+000000 eaddr=001B5354 vsid=00000000 hit=0
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-89 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB kernel debugger
Introduction
The IADB is kernel debugger used on AIX5L running on IA-64 platform.
Availability
The kernel debugger must be enabled in order to be used on AIX5L.
The following command should return 0000000000000001 if the kernel debugger
was enabled :
# iadb
(0)> d dbg_avail
E000000004755BD8: 0000000000000001
Overview
The major functions of the IADB are :
•
•
•
•
•
•
loading IADB
Setting breakpoints within the kernel or kernel extensions
Execution control through various forms of step commands
Formatted display of selected kernel data structures
Display and modification of kernel data
Display and modification of kernel instructions
Modification of the state of the machine through alteration of system registers
In AIX5L, the IADB is included in the unix_ia64 kernel located in /usr/lib/boot.
In order to use it, the IADB must be loaded at boot time. To allow IADB to load
use the following command :
bosboot -a -D -d /dev/ipldevice, or bosdebug -D : will load
IADB at boot time.
• bosboot -a -I -d /dev/ipldevice, or bosdebug -I : will
load and invoke the IADB at boot time.
• bosboot -ad /dev/ipldevice, or bosdebug -o : will not load
and invoke the IADB at boot time.
You must reboot the system in order to take these changes in account.
Continued on next page
-90 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB kernel debugger
starting IADB
-- continued
The KDB maybe be started, if loaded, under the following circumstances :
• If the bosboot or bosdebug was run with -I, this mean that the tty
attached to a native serial port will show up the IADB just after the kernel is
loaded.
• You may invoke manually the IADB from a tty attached to a native serial port
using a native keyboard using Ctrl-alt-Numpad4. For example:
Debugger entered by hitting cntrl-atl-numpad4
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
Debugger entered via keyboard with key in SERVICE position using
numpad 4
IP->E00000000008C910 waitproc_find_run_queue()+210: { .mib
==>0:
adds
sp = 0x40, sp
1:
mov.i
ar.lc = r33
2:
br.ret.sptk.few
rp
;;
}
>CPU0>
• An application make a call to the breakpoint() kernel services or to the
breakpoint system call.
• A breakpoint previously set using the IADB has been reached
• A fatal system error occurs. A dump might be generated on exit from the
IADB.
IADB concept
When the IADB Kernel Debugger is invoked, it is the only running program until
you exit IADB or you use the start sub command to start another cpu. All
processes are stopped and interrupts are disabled. The IADB Kernel Debugger
runs with its own Machine State Save Area (mst) and a special stack. In addition,
the IADB Kernel Debugger does not run operating system routines. Though this
requires the kernel code be duplicated within IADB, it is possible to break
anywhere within the kernel code. When exiting the IADB Kernel Debugger, all
processes continue to run unless the debugger was entered via a system halt.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-91 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
iadb command
Introduction
The iadb command, unlike the IADB kernel debugger, allows examination of an
operating system image issued on IA-64 systems.
The iadb command may be used on a running system but will not provide all
functions available with the IADB kernel debugger.
Parameters
The iadb command maybe used with the following parameters :
• no parameter : the iadb will use /dev/mem as the system image file and /usr/lib/
boot/unix as the kernel file. In this case root permissions are required.
• -d system_image_file : the iadb will use the image file provided.
• -u kernel_file : the iadb will use the kernel file. This is required to analyze a
system dump on a system that has a different unix level.
• -i include file list(may be comma separated)
• -u user modules list for any symbol retrieval(comma separated list)
Loading errors
If the system image file provided doesn’t contain a valid dump or the kernel file
doesn’t match the system image file, the following message may be issued by the
iadb command:
# iadb -u /usr/lib/boot/unix -d dump_file
**TBD**
-92 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB break point and step sub commands
Introduction
The following table represents the breakpoint and step sub commands and their
matching crash/lldb sub commands when available
breakpoint and step
function
set/list break point
set/list local break point
clear local break point
clear break points
clear all breakpoint
go to end of function
go until address
single step
step a bundle
step to next branch
step on bl/blr
step on branch
br sub command
IADB sub
commands
br
c
sr
s/so
sb
stb
iadb sub
commands
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
The br subcommand can be used to set and display software break points. The br
subcommand accept the following options :
•
•
•
•
•
•
•
•
c sub command
crash/lldb sub
commands
None : will display the currently set break points.
-a ‘N’ : will break after ‘N’ occurrences
-c {expr} : will break if the condition {expr} is true
-d : deferred, will set the break point when the module will be loaded
-e ‘N’ : will break every ‘N’ occurrences
-t ‘tid’ : will break only if current thread id is ‘tid’
-u ‘N’ : break up to ‘N’ occurrences
address : the break point address
The c sub command can be use to clear some or all break points. The c sub
command accept the following parameters :
• index : index of the break point as listed in the br output
• address : address of the break point
• all : clear all break points.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-93 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB break point and step sub commands
Examples
-- continued
The following example will show the use of br,c and s sub commands :
# ps -mF "THREAD" <== search for our thread id
USER PID PPID TID S CP PRI SC WCHAN
F TT BND COMMAND
root 8008 1 - A 0 60 1
- 240001 0 - -ksh
- - - 10865 S 0 60 1
400 - - #<== hit ctrl-alt-numpad4 to enter the IADB
AIX/IA64 KERNEL DEBUGGER ENTERED Due to....
Debugger entered via keyboard.
IP->E0000000000884B1 waitproc()+131: { .mii
0:
ld4.acq
r40 = [r36]
==>1:
adds
r8 = 0x1, r41
;;
2:
cmp.eq
p6, p0 = 0, r40 }
> dis kread+90 <== in bundle 90 of kread we have a branch to rdwr()
E000000000333B90 kread()+90: { .mib
0:
st8
[r11] = r9
1:
nop.i
0
2:
br.call.sptk.few
rp = <rdwr()+0>
;; }
> br -a 5 -t 2A71 kread <== set a break point after 5 kread for our tid
> br <== list break points
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
> go <== exit IADB
See Ya!
# <== hit enter, this will call 3 kread
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
# <== hit enter, this will call 3 kread
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
brk[0] = br -a 5 -t (tid = 2A71 pid = 1F48) kread()+0
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...<== after 5 kread we enter IADB
Break instruction interrupt.
IP->E000000000333B00 kread()+0: { .mii
==>0:
alloc
r35 = ar.pfs, 5, 0, 5, 0
1:
adds
sp = -0xA0, sp
2:
mov
r36 = rp
;; }
> s <== we step one instruction at a time in bundle 1
IP->E0000000002E220 kread()+1: { .mii
==>0:
alloc
r35 = ar.pfs, 5, 0, 5, 0
1:
adds
sp = -0xA0, sp
2:
mov
r36 = rp
;; }
Continued on next page
-94 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB break point and step sub commands
-- continued
Examples
continued
> sb <== we step to next bundle (bundle 10)
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
Break instruction interrupt.
IP->E0000000002E2230 kread()+10: { .mii
==>0:
adds
r8 = 0x18, sp
1:
adds
r40 = 0x20, sp
2:
adds
r9 = 0x28, sp }
> stb <== we step to next branch that points to rdwr()
Another thread is currently stepping. To avoid
confusion, only one thread can be actively
stepped.
Would you rather step this thread? (y/n) y
IP->E0000000002E2620 rdwr()+0: { .mii
==>0:
alloc
r41 = ar.pfs, 11, 0, 6, 0
1:
adds
sp = -0x50, sp
2:
mov
r42 = rp
;; }
> sr <== we return from rdwr() so we come back in kread in bundle A0
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
Break instruction interrupt.
IP->E0000000002E22A0 kread()+80: { .mii
==>0:
adds
r9 = 0, r8
1:
nop.i
0
;;
2:
cmp4.eq
p6, p7 = 0, r9
> c all <== we clear all break point when the job is done
> br <== list break points
No Active Breakpoints
> go <== exit IADB
See Ya!
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-95 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB dump/display/decode sub commands
Introduction
The following table represents the dump/display/decode sub commands and their
matching crash/lldb sub commands when available
dump/display/decode
function
display byte data
display word data
display double word data
display code
crash/lldb sub
commands
N/A
od (2 units)
od (4 untis)
decode/od
(format I)
display registers
display device byte
display device half word
display device word
display device double word
display physical memory
display pci config space
find pattern
extract pattern
d sub command
iadb sub
commands
d (ordinal 1)
d (ordinal 4)
d (ordinal 8)
dis
b/cfm/fpr/iip/
iipa/ifa/intr/
ipsr/isr/itc/kr/
p/perfr/r/rr/rse
dio (ordinal 1)
dio (ordinal 2)
dio (ordinal 4)
dio (ordinal 8)
dp
dpci
b/cfm/fpr/iip/
iipa/ifa/intr/
ipsr/isr/itc/kr/
p/perfr/r/rr/rse
dio (ordinal 1)
dio (ordinal 2)
dio (ordinal 4)
dio (ordinal 8)
***TBD
***TBD
find
The d sub command can be use to display virtual memory using the following
parameters :
•
•
•
•
dp sub
command
IADB sub
commands
d (ordinal 1)
d (ordinal 4)
d (ordinal 8)
dis
address : address or symbol to dump
ordinal : number of byte access (1,2,4,or 8)
number : number of elements to dump (of size 'ordinal')
none : continue dumping from previous d sub command
The dp sub command can be used to display physical memory using :
• address : physical address to dump
• ordinal : number of byte access (1,2,4,or 8)
• count : number of elements to dump (of size 'ordinal')
Continued on next page
-96 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB dump/display/decode sub commands
dio sub
command
Guide
-- continued
The dio sub command can be used to display the I/O space using the following
parameters :
• port : I/O port address to dump
• ordinal : number of byte access (1,2,4,or 8)
• count : number of elements to dump (of size 'ordinal')
dis subcommand
The dis sub command can be used to list instructions at a defined address using :
• address : address or symbol to disassemble
• count : number of bundles to disassemble
registers sub
commands
The following sub commands can be used to display registers informations :
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
dpci sub
command
b :Display Branch Register(s)
cfm : Display Current Stacked Register
fpr : Display FPR(s) (f0 - f127)
iip : Display or Modify Instruction Pointer
iipa : Display Instruction Previous Address
ifa : Display Fault Address
intr : Display Interrupt Registers
ipsr : Display/Decode IPSR
isr : Display/Decode ISR
itc : Display Time Registers ITC ITM & ITV
kr : Display Kernel Register(s)
p : Display Predicate Register(s)
perfr : Display Performance Register(s)
r : Display General Register(s)
rr : Display Region Register(s)
rse : Display Register Stack Registers
The dpci sub command can be used to display pci devices configuration space
using the following parameters :
•
•
•
•
•
bus : Hardware bus number of target PCI bus
dev : PCI Device Number of target PCI device
function : PCI Function Number of target PCI device
register : Configuration register offset to read
ordinal : Size of access to make (1,2,4,8)
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-97 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB dump/display/decode sub commands
Examples
-- continued
>CPU0> d dbg_avail <== Display Virtual memory address
at dbg_avail
E00000000407D6E0: 0000000000000001
>CPU0> dp 0x1000 2 5 <== Display 5 half-words from
physical address 0x1000
0000000000001000: 0000 0000 0000 0000 0000
>CPU0> dio 0x3f6 1 8 <== Display 8 bytes from port
0x3F6
00000FFFFC0FDBF6: 50FF000000006F60
P.....o‘
>CPU0> dis kread <== Disassemble from kread
E0000000002E2220 kread()+0: { .mii
0:
alloc
r35 = ar.pfs, 5,
0, 5, 0
1:
adds
sp = -0xA0, sp
2:
mov
r36 = rp
;; }
>CPU0> dpci 0 0x58 0 0x20 4 <== Display 4-byte word
from PCI config register 0x20 for device dpci 0 0x58 0
0x20 4 0x58, function 0, on bus0
PCI Config Space Bus 0, Dev 0x58, Fnc 0:
reg 20: FFFFFFFF
>CPU0> d enter_dbg <== Display Virtual memory address
at enter_dbg
E0000000040CF150: 0000000000000000
>CPU0> m enter_dbg 4 0x43 <== Modify enter_dbg with a
4-byte store of data 0x43
E0000000040CF150: 00000043
Continued on next page
-98 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB dump/display/decode sub commands
Examples
continued
Guide
-- continued
>CPU0> d enter_dbg <== Display Virtual memory address
at enter_dbg
E0000000040CF150: 0000000000000043
>CPU0> dp 0x5000 <== Display physical memory at
location 0x5000
0000000000005000: FFFFFFFFFFFFFFFF
>CPU0> mp 0x5000 8 0x1122334455667788 <== Modify
Physical memory at location 0x5000 with 8-byte store of
data 0x1122334455667788
0000000000005000: 1122334455667788
>CPU0> dp 0x5000 <== Display physical memory at
location 0x5000
0000000000005000: 1122334455667788
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-99 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB modify memory sub commands
Introduction
The following table represents the modify memory sub commands and their
matching crash/lldb sub commands when available
modify memory
function
modify sequential bytes
modify sequential word
modify sequential double word
modify registers
modify device byte
modify device half word
modify device double word
modify physical memory
m sub command
IADB sub
commands
m
iadb sub
commands
N/A
N/A
N/A
b/iip/kr/p/r/rr N/A
mio
N/A
N/A
N/A
mp
N/A
The m sub command can be used to modify the virtual memory contents using :
•
•
•
•
mp sub
command
crash/lldb sub
commands
alter -c
alter -w
alter -l
addr : symbol or virtual address to modify
ordinal : size of each data element (1,2,4,8)
data1 : first data element to be stored with access of size 'ordinal'
data2.. : subsequent data elements to be stored
The mp sub command can be used to modify the physical memory contents with
the following parameters :
•
•
•
•
addr : physical address to modify
ordinal : size of each data element (1,2,4,8)
data1 : first data element to be stored with access of size 'ordinal'
data2.. : subsequent data elements to be stored
Continued on next page
-100 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB modify memory sub commands
registers sub
commands
The following sub commands can be used to modify registers informations :
•
•
•
•
•
•
mio sub
command
b :Set Branch Register(s)
iip : Modify Instruction Pointer
kr : Set Kernel Register(s)
p : Set Predicate Register(s)
r : Set General Register(s)
rr : Set Region Register(s)
The mio sub command can be use to modify I/O space using :
•
•
•
•
Examples
-- continued
addr : I/O port address to modify
ordinal : size of each data element (1,2,4,8)
data1 : first data element to be stored with access of size 'ordinal'
data2.. : subsequent data elements to be stored
>CPU0> b <== Display branch registers
b00:E00000000008E050 waitproc()+1B0
b01:BADC0FFEE0DDF00D
b02:BADC0FFEE0DDF00D
b03:BADC0FFEE0DDF00D
b04:BADC0FFEE0DDF00D
b05:BADC0FFEE0DDF00D
b06:E00000000008DEA0 waitproc()+0
b07:BADC0FFEE0DDF00D
>CPU0> iip <== Display instruction pointer
IIP
: E00000000008E000:waitproc()+160
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-101 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB modify memory sub commands
Examples
continued
-- continued
>CPU0> kr <== Display all kernel registers
kr0:00000FFFFC000000
kr1:0000000000000000
kr2:0000000000000000
kr3:0000000000000000
kr4:C000006013220000
kr5:0200000000000000
kr6:C00000601324CCC0
kr7:C000006013200000
>CPU0> p <== Display all predicate registers
p00:1
p01:0
p02:0
p03:0
p04:0
p05:0
p06:0
p07:1
p08:0
p09:0
p10:0
p11:0
p12:0
p13:0
p14:0
p15:0
p16:0
p17:0
p18:0
p19:0
p20:0
p21:0
p22:0
p23:0
p24:0
p25:0
p26:0
p27:0
p28:0
p29:0
p30:0
p31:0
p32:0
p33:0
p34:0
p35:0
p36:0
p37:0
p38:0
p39:0
p40:0
p41:0
p42:0
p43:0
p44:0
p45:0
p46:0
p47:0
p48:0
p49:0
p50:0
p51:0
p52:0
p53:0
p54:0
p55:0
p56:0
p57:0
p58:0
p59:0
p60:0
p61:0
p62:0
p63:0
Continued on next page
-102 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB modify sub commands
Examples
continued
-- continued
>CPU0> r <== Display all general registers
r00:BADC0FFEE0DDF00D
r01:E000000004002818
r02:BADC0FFEE0DDF00D
r03:BADC0FFEE0DDF00D
r04:BADC0FFEE0DDF00D
r05:BADC0FFEE0DDF00D
r06:BADC0FFEE0DDF00D
r07:BADC0FFEE0DDF00D
r08:0000000000000000
r09:0000000000000000
r10:0000000000000002
r11:0000000080000000
r12:0003FEFFF3FFF7C0
r13:E00000971405C600
r14:E00000971404C02C
r15:E00000971404B028
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
r32:C000006013200000
r33:C000006013200290
r34:E00000971404B11C
r35:E00000971404B120
r36:E0000000040C6060
r37:E0000000040C6068
r38:0000000000000186
r39:0000000000000009
r40:0000000000000001
r41:0000000000000001
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
r16:E00000971404B008
r17:00000000C0000000
r18:0000000000000014
r19:0000000040000000
r20:E00000971404D000
r21:0000000000000000
r22:0000000000000000
r23:E00000971404D008
r24:0000000000000000
r25:BADC0FFEE0DDF00D
r26:BADC0FFEE0DDF00D
r27:BADC0FFEE0DDF00D
r28:BADC0FFEE0DDF00D
r29:BADC0FFEE0DDF00D
r30:BADC0FFEE0DDF00D
r31:BADC0FFEE0DDF00D
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-103 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB modify memory sub commands
Examples
continued
-- continued
>CPU0> rr <== Display all region registers
rr0:0000000000480931
rr1:0000000000200431
rr2:0000000000280531
rr3:0000000000000030
rr4:0000000000000030
rr5:0000000000180331
rr6:0000000000100269
rr7:0000000000080131
>CPU0> mio 0x408 8 0 <== Modify I/O port 0x408 with 8byte store of data 0
-104 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB name list/symbol sub commands
The following table represents the name list/symbol sub commands and their
matching crash/lldb sub commands when available
name list symbol
function
translate symbol to eaddr
no symbol mode (toggle)
translate eaddr to symbol
map sub
command
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
nm
map
map
ts/ds
map
map
The map sub command can be used to translate a symbol into an address and
revers and so accept the following as parameter :
• symbol : symbol to show address for
• address : address to show symbol for
Examples
>CPU0> map (r34) <== Lookup symbol for address in r34
>CPU0> map 0xe000000000000000 <== Lookup symbol for
address 0xe000000000000000
>CPU0> map foo+0x100 <== Lookup symbol for symbol
‘foo’+0x100
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-105 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB watch break point sub commands
Introduction
The following table represents the watch break point sub commands and their
matching crash/lldb sub commands when available
watch break point
function
stop on read data
stop on write data
stop on r/w data
local stop on read data
local stop on write data
local stop on r/w data
clear watch
local clear watch
dbr sub
command
crash/lldb sub
commands
IADB sub
commands
dbr r
dbr w
dbr rw
iadb sub
commands
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
cdbr
The dbr command can be used to set break point on data access using :
• action : the action to watch for :
• r = = Break on Read
• w = = Break on Write
• rw = = Break on Read or Write
• mask : bit mask of which address bits to match
• plvl_mask : bit mask of which privilege levels to match
• 0x1 = = CPL 0 (Kernel)
• 0x2 = = CPL 1 (unused)
• 0x2 = = CPL 1 (unused)
• 0x4 = = CPL 2 (unused)
• 0x8 = = CPL 3 (User)
• addr : the address to trigger on
cdbr sub
command
The cdbr sub command can be used to clear previously set data break points using
:
• index : index of DBR breakpoint (from dbr cmd)
• all : clear all DBRs
Continued on next page
-106 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB watch break point sub commands
Examples
Guide
-- continued
>CPU0> dbr <== Display all current breakpoints
>CPU0> dbr foo <== Break on access to ‘foo’
>CPU0> dbr -t foo <== Break on any access to ‘foo’ for
current thread
>CPU0> cdbr 3 <== Clear DBR in slot 3
>CPU0> cdbr 0xe000000000011cc0 <== Clear DBR at address
0xe000000000011cc0
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-107 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB machine status sub commands
Introduction
The following table represents the trace sub commands and their matching crash/
lldb sub commands when available
machine status
function
system status message
crash/ IADB
iadb
lldb
sub
sub
sub
comma comma
comma
nds
nds
nds
stat
sys+rea
son
switch thread
sys sub
command
The sys sub command will display the following information :
•
•
•
•
•
reason sub
command
Build level and build date
Number and type of processors
Memory size
Processor Speed
Bus Speed
The reason sub command will display the reason why debugger was entered along
with IP and assembly code of the bundle at that IP
Continued on next page
-108 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB machine status sub commands
Examples
Guide
-- continued
>CPU1> sys <== Display system information
Kernel
: AIX 0036E_500IA, Built on Sep 27 2000
at 14:52:02
Memory
: 1023MB
Processors : 2 Itanium, Stepping 0
Proc Speed : 665374960 HZ
Bus Speed : 133074992 HZ
>CPU1> reason <== Display reason debugger was entered
Debugger entered via keyboard with key in SERVICE
position using numpad 4
IP->E00000000008E000 waitproc()+160: { .mii
==>0:
alloc
r35 = ar.pfs, 5,
0, 5, 0
1:
adds
sp = -0xA0, sp
2:
mov
r36 = rp
;; }
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-109 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB kernel extension loader sub commands
Introduction
The following table represents the kernel extension loader sub commands and
their matching crash/lldb sub commands when available
kernel extension loader
function
list loaded extension
crash/ IADB
iadb
lldb
sub
sub
sub
comma comma
comma
nds
nds
nds
le
kext/
ldsyms/
unldsy
ms
list loaded symbol tables
remove symbol table
list export tables
kext
The kext sub command will display all loaded kernel extensions and their text and
data load addresses
ldsyms and
unldsyms sub
commands
The ldsyms and unldsyms will load or unload a kernel extension symbols using :
examples
(0) kext <== list loaded kernel extensions
.
.
Name
: /usr/lib/drivers/isa/kbddd
TextMapped: 0xE000009729630000 to 0xE000009729645FFF, Size: 0x00016000
DataMapped: 0xE000009729660000 to 0xE000009729665FFF, Size: 0x00006000
UnwindTBL: 0xE000009729644BA8 to 0xE0000097296453E7, Size: 0x00000840
TextStart: 0xE000009729630120
Load count: 2
Use count: 0
.
.
(0) nm kbdconfig <== try to get address for kbdconfig symbol
Symbol not found
(0)>ldsyms kbddd <== load kbddd symbols
(0)>nm kbdconfig <== now nm should work
kbdconfig : e000009729639560
• -p [path] : where path is the absolute file path of the kernel extension
• module : the module name
-110 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB address translation sub commands
Introduction
The following table represents the address translation sub commands and their
matching crash/lldb sub commands when available
address translation
function
translate to real address
display MMU translation
parameters
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
x
x addr
where;
addr = symbol or virtual address to translate
Examples
>CPU0> x foo+0x4000 <== Display the physical
translation for foo+0x400
>CPU0> x 0x20000000 <== Display the physical
translation for virtual address 0x20000000
>CPU0> x (r1) <== Display the physical address in r1
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-111 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB process/thread sub commands
Introduction
The following table represents the process/thread sub commands and their
matching crash/lldb sub commands when available
process
function
display per processor data
area
display interrupt handler
display mst area
display process table
display thread table
display thread tid
display thread pid
display user area
display run queue
display sleep queue
ppda
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
ppd
ppda
ppda
mst
proc
th
mst
pr
th
mst
pr
th
th
th
th
user/du
us
rq
sq
us
The ppda sub command will display Per Processor Descriptor Area and accept the
following parameters :
• cpu : which CPU's ppda to display (logical numbering)
mst
The mst sub command will display the Machine State Stack using :
• addr : address of an MST to display
pr
The pr sub command will display process informations using :
•
•
•
•
•
-p {value} :for process where PID = = {value}
-s {value} : for process in slot {value}
-v {value} : for proc struct pointer = = {value}
-a : detailed display for all processes
* : process table display
Continued on next page
-112 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB process/thread sub commands
th
-- continued
The th sub command will display thread information related to :
•
•
•
•
•
us
Guide
-s {slot} : detailed thread info for thread in 'slot'
-t {tid} : detailed thread info for thread 'tid'
-v {thrdptr} : detailed thread info for thread pointer ''thrdptr'
-a : detailed thread info for all threads
* : display thread table
The us sub command will display user structure information for:
• -p : process id (PID)
• -t : Thread id (TID)
• * : All processes
rq
The rq will return the run queue information related to :
•
•
•
•
sq
-b {bucket} : detailed info for threads in bucket of all run queue slots
-g : global info for run queues
-q [ number ] : detailed info for all queues
-v {address} : detailed info for threads at run queue address
The sq sub command will display the sleep queue related to :
• -b {bucket} : detailed info for threads in 'bucket'
• -v {address} : detailed info for threads at sleep queue 'address'
Examples
***TBD
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-113 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB LVM sub commands
Introduction
The following table represents the LVM sub commands and their matching crash/
lldb sub commands when available
LVM
function
display physical buffer
display volume group
display physical volume
display logical volume
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
parameters
Examples
-114 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB SCSI sub commands
Introduction
The following table represents the scsi sub commands and their matching crash/
lldb sub commands when available
SCSI
function
crash/lldb
sub
commands
IADB sub
commands
iadb sub
commands
display ascsi
display vscsi
display scdisk
parameters
Examples
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-115 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB memory allocator sub commands
Introduction
The following table represents the memory allocator sub commands and their
matching crash/lldb sub commands when available
memory allocator
function
display kernel heap
display kernel xmalloc
display heap debug
display kmem buckets
display kmem statistics
crash/lldb
sub commands
xmalloc
IADB sub
commands
xmalloc
iadb sub
commands
xmalloc
parameters
Examples
-116 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB file system sub commands
Introduction
The following table represents the file system sub commands and their matching
crash/lldb sub commands when available
file system
function
display buffer
display buffer hash table
display freelist
display gnode
display gfs
display file
display inode
display inode hash table
display inode cache list
display rnode
display vnode
display vfs
display specnode
display devnode
display fifo node
display hnode hash table
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
file
inode
vnode
vfs
vnode
vfs
vnode
vfs
parameters
Examples
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-117 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB system table sub commands
Introduction
The following table represents the system table sub commands and their matching
crash/lldb sub commands when available
system table
function
display var
display devsw table
display system timer request
blocks
display simple lock
display complex lock
crash/lldb sub
commands
var
devsw
iadb sub
commands
dev
dev
iplcb
iplcb
lock -s
lock -c
display ipl proc information
display trace buffer
dev
IADB sub
commands
The dev sub command will display the device switch table using :
• major : major number slot to display
iplcb
The iplcb sub command will display the IPL control block
Examples
-118 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB network sub commands
Introduction
The following table represents the network sub commands and their matching
crash/lldb sub commands when available
network
function
display interface
display TCBs
display UDBs
display sockets
display TCP CB
display mbuf
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
netstat
sock
mbuf
parameters
Examples
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-119 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB VMM sub commands
Introduction
The following table represents the VMM sub commands and their matching crash/
lldb sub commands when available
VMM
function
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
VMM kernel segment
data
VMM RMAP
vmm-rmap
VMM control variables
VMM statistics
VMM Addresses
VMM paging device
table
vmm-pdt
VMM segment control
blocks
vmm-scb
VMM PFT entries
vmm-pft
VMM PTE entries
vmm-pte
VMM PTA segment
vmm-pta
VMM STAB
VMM segment register sr64
VMM segment status
segst64
VMM APT entries
vmm-apt
u -64
u -64
VMM wait status
VMM address map
entries
vmm-ame
VMM zeroing kproc
VMM error log
VMM reload xlate
table
IPC information
vmm-sem/shm
VMM lock anchor/
tblock
VMM lock hash table
VMM lock word
VMM disk map
VMM spin locks
Continued on next page
-120 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB VMM sub commands
Guide
-- continued
parameters
Examples
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-121 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB SMP sub commands
Introduction
The following table represents the SMP sub commands and their matching crash/
lldb sub commands when available
SMP
function
Start cpu
Stop cpu
Switch to cpu
crash/lldb sub
commands
cpu
IADB sub
commands
cpu
iadb sub
commands
cpu
cpu
The cpu command can be used to display or change the current cpu you are
working on using :
• num : logical CPU number to switch to
Examples
>CPU0> cpu 1 <== Switch the debug process to processor
1
AIX/IA64 KERNEL DEBUGGER ENTERED Due to...
Debugger entered via MPC stop
IP->E00000000008C7F2 waitproc_find_run_queue()+F2: {
.mii
0:
adds
r20 = 0x1, r10
1:
shr.u
r19 = r11, r10
;;
==>2:
and
r21 = r17, r19 }
>CPU1>
-122 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB block address translation sub commands
Introduction
The following table represents the block address translation sub commands and
their matching crash/lldb sub commands when available
block address translation
function
display dbats
display ibats
modify dbats
modify ibtas
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
parameters
Examples
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-123 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB bat/brat sub commands
Introduction
The following table represents the bat/brat sub commands and their matching
crash/lldb sub commands when available
bat/brat
function
branch target
clear branch target
local branch target
clear local branch target
crash/lldb sub
commands
IADB sub
commands
iadb sub
commands
parameters
Examples
-124 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
IADB miscellaneous sub commands
Introduction
The following table represents the miscellaneous sub commands and their
matching crash/lldb sub commands when available
miscellaneous
function
reboot the machine
display help
run an aix command
set kdbx compatibility
exit
set debugger parameters
display elapsed time
enable/disable debug
calculate/convert an
hexadecimal expression
calculate/convert a decimal
expression
crash/lldb sub
commands
help/?
!
set
IADB sub
commands
iadb sub
commands
help
kdbx
go
set
set
calc
help sub
command
‘The help sub command can be used with out parameter to display the command
listing or with a command as parameter to display an help related to that
command.
kdbx sub
command
The kdbx sub command can be used to set the symbol needed to use kdb with the
kdbx interface.
The following variables are set by kdbx and will modify output of certain sub
commands :
• kdbx_addrd : Display breakpoint address instead of symbol name
• kdbx_bindisp : Display output in binary format instead of ASCII format
go sub command
The go sub command is used to leave the KDB, this will start the dump process if
the KDB was entered while the system was crashing.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-125 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
IADB miscellaneous sub commands
set sub
command
-- continued
The set sub command can be used to set or display the following kdb parameters :
• rows=number : set number of rows on current display
• mltrace={on|off} : mltrace on/off; only on DEBUG kernel
• sctrace={on|off} : verbose syscall prints on/off; only on DEBUG kernel
• itrace={on|off} : enable/disable tracing on/off; only on DEBUG kernel
• umon={on|off} : enable/disable umon performance tool
• exectrace={on|off} : verbose exec prints on/off; only on DEBUG kernel
• excpenter={on|off} : debugger entry on exception on/off
• ldrprint={on|off} : verbose loader prints on/off; only on DEBUG kernel
• kprintvga={on|off} : kernel prints to VGA on/off
• dbgtty={on|off} : use debugger TTY as console on/off
• dbgmsg={on|off} : Tee Console and LED output to TTY
• hotkey={on|off} : enter debugger on key press on/off; only on DEBUG kernel
Examples
-126 of 128 AIX 5L Internals
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Guide
Exercise
Introduction
In this exercise you will configure the system to enable the live debugger
and invoke both the live and image debugger for your system.
Complete the following steps:
Step
1.
2.
3.
4.
Action
Enable the Memory Overlay Detection
System (MODS) using the bosdebug
command.
Enable the live debugger with the bosboot
command.
Reboot the system, and login as root.
Verify MODS is enabled with the debugger.
Reference
> stat
5.
xmalloc debug: ________________
Verify the debugger is available:
Power PC:kdb
> dw kdb_avail
>q
6.
IA-64: iadb
> d dbg_avail
> go
Execute the following truss command:
# truss -t kread -i ksh
Hit the enter key. How many kread
functions were executed? __________
Enter the exit command to exit truss:
# exit
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
-127 of 128
Guide
Draft Version for review, Sunday, 15. October 2000, introSystemdump.fm
Exercise
-- continued
Step
7.
8.
9.
Action
Change directory to /var/adm/ras.
Start the image debugger against the crash
dump captured in the previous lesson.
Execute the following commands:
• iadb: reason
Reference
Why was the debugger entered?
___________________________
• kdb: p * or iadb: pr *
What is the process id for the errdemon?
____________________________
• Execute the ls command:
kdb: !ls or iadb: ! ls
• iadb: sys
What build of AIX5L was the crash dump
taken on?
10.
11.
12.
13.
-128 of 128 AIX 5L Internals
__________________________
Exit the debugger: q
Enter the live debugger:
Ctrl-Alt-NUMAPAD4
Enter the cpu command. What is the status
of CPU0?
________________________________
Exit the live debugger.
Version 20001015
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Unit 7. Process Management
Platform
This lesson is independent of platform.
Lesson Objectives
At the end of the lesson you will be able to:
• List and describe the states of a process.
• List the steps taken by the kernel to create a new process as the
result of a fork() system call, and the steps taken to create a new
thread of execution.
• Describe what happens when a process terminates.
• List the three thread models available in AIX 5.
• Identify the relationship between the internal structures proc,
thread, user and u_thread.
• Use the kernel debugging tool to locate and examine processes,
proc, thread, user and u_thread data structures.
• Manage process scheduling using available commands, manage
processes and threads on a SMP system (to best employ cache
affinity scheduling), and manage processes on a ccNUMA system
(to best employ quad affinity scheduling).
• List the factors determining what action the threads of a process
will take when a signal is received.
• Write a simple C program that use the fork() system call to spawn
new processes, that uses the wait() system call to retrieve the exit
status of a child process, that creates a simple multi-threaded
program by using the pthread_create() system call, and that uses
exec() system call to load a new program into memory.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process Management Fundamentals
Process
definition
A process can be defined by the list of items which builds it. A process
consists of:
• A process table entry
• A process ID (PID)
• Virtual address space
-
User-area (U-area)
-
Program “text”
-
Data
-
User and kernel stacks
• Statistical information
Definition of
process
management
Process management consists of the tools and ability to have many
processes and threads existing simultaneously in a system, and to share
usage of the CPU or, in a SMP system, CPUs. Process management also
includes the ability to start, stop, and force a stop of a process.
The tools and
information
used to
manage the
processes
• A process is a self-contained entity that consists of the information
required to run a single program, such as a user application.
• The kernel contains a table entry for each process called the proc entry.
• The proc entry contains information necessary to keep track of the
current state and location of page tables for the process.
• The proc entry resides in a slot in an array of proc entries.
• The kernel is configured with a fixed number of slots.
• All processes have a process ID or PID.
• The PID is assigned when the process is created and provides a
convenient way for users to refer to the other processes.
• The process contains a list of virtual memory addresses that the
process is allowed to access.
• The user-area (u_area) of a process contains additional information
about the process when it is running.
• The kernel tracks statistical information for the process, such as the
amount of time the process uses the CPU, the amount of memory the
process is using, etc. The statistical information is used by the kernel for
managing its resources and for accounting purposes.
-2 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process operations fork() system call
Process
operations
Four basic operations define the lifetime of a process in the system:
• fork - Process creation
• exec - Loading of programs in process
• exit - Death of process
• wait - The parent process notification of the death of the child process.
Fork new
processes
The fork system call is the way to create a new process
• All processes in the system (except the boot process) are created
from other processes through the fork mechanism.
• All processes are descendants of the init process (process 1).
• A process that forks creates a child process that is nearly a duplicate
of the original parent process.
• The child has a new proc entry (slot), PID, and registers.
• Statistical information is reset, and the child initially shares most of
the virtual memory space with the parent process.
• The child process initially runs the same program as the parent
process. The child may use the exec() call to run another program.
The fork()
system call
The parent process has an entry in the process and thread table before the
fork() system call; after the fork() system call, another independent
process is created with entries in the Process and Thread tables.
$,;.HUQHO
6\VWHPFDOO
Parent Process
......
......
......
fork()
......
Thread Table
3DUHQWHQWU\
&KLOGHQWU\
Child Process
Process Table
3DUHQWHQWU\
&KLOGHQWU\
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process operations fork() system call
Inherited
attributes after
a fork() system
call
-- continued
The illustration shows what happens when the fork() system call is issued.
The caller creates a child process that is almost an exact copy of the
process itself. The child process inherits many attributes of the parent, but
receives a new user block and dataregion.
The child process inherits the following attributes from the parent process:
• Environment
• Close-on-exec flags and signal handling settings
• Set user ID mode bit and Set group ID mode bit
• Profiling on and off status
• Nice value
• All attached shared libraries
• Process group ID and tty group ID
• Current directory and Root directory
• File-mode creation mask and File size limit
• Attached shared memory segments and Attached mapped file
segments
• Debugger process ID and multiprocess flag, if the parent process has
multiprocess debugging enabled (described in the ptrace subroutine).
Continued on next page
-4 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process operations fork() system call
Attributes not
inherited from
the parent
process
-- continued
Not all attributes are inherited from the parent. The child process differs
from the parent process in the following ways:
• The child process has only one user thread; it is the one that called the
fork subroutine, no matter how many threads the parent process had.
• The child process has a unique process ID.
• The child process ID does not match any active process group ID.
• The child process has a different parent process ID.
• The child process has its own copy of the file descriptors for the parent
process. However, each file descriptor of the child process shares a
common file pointer with the corresponding file descriptor of the parent
process.
• All semadj values are cleared.
• Process locks, text locks, and data locks are not inherited by the child
process.
• If multiprocess debugging is turned on, the trace flags are inherited from
the parent; otherwise, the trace flags are reset.
• The child process utime, stime, cutime, and cstime are set to 0.
• Any pending alarms are cleared in the child process.
• The set of signals pending for the child process is initialized to the
empty set.
• The child process can have its own copy of the message catalogue for
the parent process.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process operations fork() system call
The fork()
system call
code example
-- continued
The following code illutrates the usage of the fork() system call. After the
call there will be two processes executing two different copies of the same
code. A process can determine if it is the parent or the child from the return
code.
int statuslocation;
pid_t proc_id;
tproc_id=fork();
if ( proc_id < 0 ) {
printf ("fork error \n");
exit (-1);
}
if ( proc_id > 0 ) {
/*Parent process
waiting for child to terminate */
proc_id2 = wait(&statuslocation);
}
if ( proc_id == 0 ) {
/* I’m the child proces */
{.............}
Continued on next page
-6 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process operations fork() system call
Listing
processes with
the ps
command after
fork()
Executing the test program creates two processes, which can be listed
with the ps command. The program name in the example is fork and that
name is listed as the command for both the parent and the child. Note that
the child’s PPID is equal to the PID of the parent.
F S UID
Processes
without the
parent process
PID
PPID
C
PRI NI ADDR
SZ
TTY
TIME CMD
240001 A
0 10346 10236
0
60 20
5b8b
496
pts/1 0:00 ksh
200001 A
0 10742 10346
0
68 24
9bb3
44
pts/1 0:00 fork
1 A
0 10990 10742
0
68 24
dbbb
44
pts/1 0:00 fork
In the previous example, it was shown how the PID of the calling process
becomes the PPID of the child process. This example shows what
happens if the parent process terminates before the child process
terminates. If we rewrite the program so that the parent process terminates
after fork() without waiting for the child, the system will replace the PPID
with 1, which is the init process. The init process will then pickup the
SIGCHLD signal so that the system can free the process table, even
though the parent process does not exist. This situation is shown below:
F S UID
PID
240001 A
40001 A
200001 A
Zombie
processes
-- continued
PPID
C PRI NI ADDR
SZ
TTY
TIME CMD
0 10346 10236
0
60 20 5b8b 496
pts/1 0:00 ksh
0 10996
1
0
68 24 8330
pts/1 0:00 fork
0 11216 10346
3
61 20 dbbb 244
44
0:00 ps
If, for some reason, no processes receive the SIGCHLD signal from the
child, the empty slot will remain in the process table, even though other
resources are released. Such a process is called a zombie, and is listed in
ps as <defunct>. The example below shows some of these zombie
processes.
.....F S UID
PID
PPID
200003 A
0
1
0
0
60 20 500a
240401 A
0
2502
1
0
240001 A
0
2622
2874
0
40001 A
0
2874
1
0
50005 Z
0
3776
1
C PRI NI ADDR
1
TTY
TIME CMD
704
SZ
-
0:03 init
60 20 d2da
40
-
0:00 uprintfd
60 20 2965
5208
-
0:46 X
60 20 c959
384
-
68 24
0:00 dtlogin
0:00 <defunct>
40401 A
0
3890
1
0
60 20 91d2
480
-
0:00 errdemon
240001 A
0
4152
1
0
60 20 39c7
88
-
0:21 syncd
240001 A
0
4420
4648
0
60 20 4b29
220
-
0:00 writesrv
240001 A
0
4648
1
0
60 20 b1d6
308
-
0:00 srcmstr
50005 Z
0
10072
1
0
68 24
0:00 <defunct>
50005 Z
0
10454
1
0
68 24
0:00 <defunct>
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process operations exec() system call
Exec system
call to load a
new program
The exec subroutine does not create a new process; it loads a new
program into the process.
• To execute a new program, a process uses the exec set of system
calls to load the new program into memory and execute the program.
• Each program can successively exec other programs to load and
execute in the process.
Valid program
files for the
exec() system
call
The fork() system call creates a new process with a copy of the
environment, and the exec() system call loads a new program into the
current process, and overlays the current program with a new one (which
is called the new-process image). The new-process image file can be one
of three file types:
• An executable binary file in XCOFF file format.
• An executable text file that contains a shell procedure.
• A file that names an executable binary file or shell procedure to be run.
Inherited
attributes after
the exec()
system call
The new-process image inherits the following attributes from the calling
process image: session membership, PID, PPID, supplementary group
IDs, process signal mask, and pending signals.
Continued on next page
-8 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process operations exec() system call
The exec()
system call
-- continued
The illustration show how the process and thread table remain unchanged
after the exec() system call.
6\VWHPFDOOV
Parent Process
......
......
......
exec()
......
Thread Table
Process Table
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process operations exec() system call
The exec()
system call
code example
The following code illustrates the usage of the execv() system call. After
the call, the current process will be overlaid with the new program. To
illustrate the function, the output from the program is listed after the
program.
The program first defines two variables. The first is a pointer to the
program name to be executed, and the second is a pointer to the
arguments (by convention the first argument parsed is the program name
itself). The program source for sleeping.c is not supplied, as any program
can be used for this example.
#include <unistd.h>
int returncode;
char *argumentp[3],arg1[50],arg2[50],arg3[50];
const char *Path="/home/olc/prog/thread/sleeping";
main(argc,argv)
int argc;
char **argv;
{
strcpy (arg1,"/home/olc/prog/thread/sleeping");
strcpy (arg2,"test param 1");
strcpy (arg3,"test param 2");
argumentp[0]=arg1;
argumentp[1]=arg2;
argumentp[2]=arg3;
/* ArgumentV=*arguments; */
printf ("before execv
\n");
returncode = execv(Path,argumentp);
printf ("after execv
\n");
exit (0);
}
and the program output:
before execv
I’m the sleeping process
Continued on next page
-10 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process operations exec() system call
The exec()
system call
-- continued
While the program in the example is being executed, we can examine the
process status with the ps command. Notice that the program name for the
example is “exec,” and the program name for the called program is
“sleeping.” As we see in the listing from the ps command, the current
program is replaced with the new one, and we never reach the print
statement "after execv\n." The program prints “I’m the sleeping process,”
because the main program has been replaced with the program in the path
variable. If we look closer at the output from the ps -l command before
and after the system call, we can tell that the program name has been
replaced, but the process ID and PPID remains the same.
Before the exec system call take place:
#> ps -l
.
F S UID
PID
PPID
C PRI NI ADDR
SZ
TTY
TIME CMD
240001 A
0 10346 10236
0
60 20 5b8b
492 pts/1
0:00 ksh
200001 A
0 10696 10346
2
61 20 6bad
240 pts/1
0:00 ps
200001 A
0 10964 10346
0
68 24 4388
40 pts/1
0:00 exec
And after the exec() system call, the exec program is replaced with
sleeping:
#> ps -l
.
F S UID
PID
PPID
C PRI NI ADDR TTY
TIME CMD
240001 A
0 10346 10236
0
60 20 5b8b pts/1 0:00 ksh
200001 A
0 10698 10346
2
61 20 a354 pts/1 0:00 ps
200001 A
0 10964 10346
0
68 24 4388 pts/1 0:00 sleeping
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process operations exit system call
Exit: what
happens when
a process
terminates
The exit system call is executed at the end of every process, the system
call cleans up, releases memory, text and data, but leaves an entry in the
process table so that a return value and other status information can be
passed to the parent process if needed.
• exit - termination of a process
• When a program no longer needs to run or execute other programs,
it can exit.
• A program that exits causes the process to enter the zombie state.
Exiting from a
program
There are basically three ways that a process can terminate: the program
can have reached the end of the program flow and meet an explicit
exit(exit_value) statement, the program flow can end without an exit()
statement (in which case the linker automatically inserts a call to the exit
system call), or the running program receives a signal from an external
source such as keyboard interrupt (<Ctrl-c>) from the user. If the program
receives an interrupt, the program path will switch to the interrupt handling
routine, either in the program, or the system default routine, which will
terminate the program with an exit.
When executing the exit() system call, all memory and other resources are
freed, and the parameter supplied to exit(0 is placed in the process table
as the exit value for the process. After the completion of the exit() system
call, a signal SIGCHLD is issued to the parent process (the process at this
stage is nothing but the process table entry). This state is called the
zombie state, when the parent process reacts to the SIGCHLD signal and
reads the return code from the process table, the system can remove the
process table entry, clean up, and free the process table entry.
In rare occasions the parent process can not respond to the signal
immediately, we can see the zombie in the process table with the ps
command. A zombie will be listed as <defunct>.
-12 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process operations, wait() system call
Waiting for the
death of a
child process
The wait system call is placed at the end of a program; normally it is placed
there by the programmer as the system call wait(), but if not, the system
will automatically add a wait one. The wait call is used to notify the parent
process of the death of the child process and for releasing the child’s
process slot.
• The parent process can be notified of the death of the child by waiting
with a system call or catching the proper signal.
• Once the parent process acknowledges the death of a child process,
the child process' slot is freed.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process states
Process states
-- continued
In AIX, processes can be in one of five states:
• Idle
• Active
• Stopped
• Swapped
• Zombie
Idle state
When processes are being created, they are first in the idle state. This
state is temporary until the fork mechanism is able to allocate all of the
necessary resources for the creation and fill in the process table for a new
process.
Active state
Once the new child process creation is completed, it is placed in the active
state. The active state is the normal process state, and threads in the
process can be running or be ready-to-run.
Stopped
processes
Processes can also be stopped or in a stopped state. Process can be
stopped by the SIGSTOP signal. Stopped processes can be restarted by
the SIGCONT signal. If a process is stopped, all threads are in the stopped
state.
Swapped
processes
If a process is swapped, it means that another process is running, and the
process, or any threads, cannot run until scheduler makes it active again.
Continued on next page
-14 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process states
Zombie
process
-- continued
When a process terminates with an exit system call, they first goes into the
zombie state, such processes have most of their resources freed.
However, a small part of the process remains, such as the exit value that
the parent process uses to determine why the child process died. If the
parent process issues a wait system call, the exit status is returned to the
parent, and the remaining resources of the child process are freed, and the
process ceases to exist. The slot can then be used by another newly
created process.
If the parent process no longer exists when a child process exits, the init
process frees the remaining resources held by the child. Sometimes we
can see a Zombie staying in the process list for a longer time; one example
of this situation could be that a process exited, but the parent process is
busy or waiting in the kernel and unable to read the return code.
State
transitions for
AIX processes
The illustration show how a process is being started with a fork() system
call, turns into an active process, and how active process can change
between swapped, active and stopped state. A terminating process
becomes a zombie until the entire process is removed.
Idle
fork()
Active
Swapped
Stopped
Zombie
Non existing
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Kernel Processes
Kernel processes:
Kernel
processes Kproc
• Are created by the kernel.
• Have a private u-area/kernel stack.
• Share "text" and data with the rest of the kernel.
• Are not affected by signals.
• Cannot use shared library object code or other user protection domain
code.
• Run in the Kernel Protection Domain.
Some processes in the system are kernel processes. Kernel processes
are created by the kernel itself to execute independent of threads action.
Even though a kernel process shows up in the process table, through
"Berkeley" ps, it is part of the kernel. The scheduler is one example of a
kernel process. Kernel processes are scheduled like user processes, but
tend to have higher priorities.
Kernel processes can have multiple threads, as can user processes.
-16 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Thread Fundamentals
Thread
definition
Like a process, a thread can be defined by separate components. A thread
consists of:
• A thread table entry
• A thread ID (TID)
Processes and
threads
• Process holds address space
• Thread holds execution context
• Multiple threads can run within one process
-
Threads
One CPU can run one thread at a time, on SMP systems, threads
can actually run truly concurrent
• Threads allow multiple execution units to share the same address
space.
• The thread is the fundamental unit of execution.
• Thread has IDs (TIDs) like a process has IDs (PIDs).
• An independent flow of control within a process.
• In a multi threaded process, each thread can execute on a different
code concurrently.
• Managing threads needs fewer resources than managing processes.
• Inter-thread communication is more efficient than inter-process
communication, especially because variables can be shared.
Threads share
data and
address space
Threads reduce the need for IPC operation, because they allow multiple
execution units to share the same address space, and thereby easily
share data. On the other hand, it adds complexity and risk to the
programming. For example: synchronization and locking has to be
controlled by the threads.
Threads are
the unit of
execution
The thread is the fundamental unit of execution and the scheduler and
dispatcher only work with threads. Therefore, every process has at least
one thread.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Thread Fundamentals
Thread IDs
(TID) and
Process IDs
(PID)
-- continued
TIDs are listed for all threads in the threads table; TIDs are always odd.
PIDs are listed for all processes in the process table; PIDs are always
even, except for the init process, where PID = 1. Threads represent
independent flows within a process; the system does not provide
synchronization, and the control must be in the thread itself.
In a multi-threaded process, each thread can execute on a different code
concurrently controlled by the program paths.
One of the main reasons for using threads is that managing threads
requires fewer resources than managing processes. Inter-thread
communication is more efficient than inter-process communication.
-18 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
AIX Thread
AIX Threads
• A thread is an independent flow of control that operates within the same
address space as other independent flows of controls within a process.
In other operating systems, threads are sometimes called "lightweight
processes," or the meaning of the word "thread" is sometimes slightly
different.
• Multiple threads of control allow an application to overlap operations
such as reading from a terminal or writing to a disk file. This also allows
an application to service requests from multiple users at the same time.
• Multiple threads of control within a single process are required by
application developers to be able to provide these capabilities without
the overhead of multiple processes.
• Multiple threads of control within a single process allow application
developers to exploit the throughput of multiprocessor (MP) hardware.
TID format
Threads IDs have the following format for 32-bit kernels:
31
24
0 0 0 0 0 0
8 7
INDEX
0
1
COUNT
1
And for 64-bit kernels the TID is 64-bit
63
56
0 0 0 0 0 0
8 7
INDEX
0
1
COUNT
1
• INDEX identifies the entry in the thread table corresponding to the
designated TID (thread[INDEX]).
• COUNT is a generation count that is intended to avoid the rapid
reallocation of TIDs. When a new TID is to be allocated, its value is
calculated on the first available thread table entry. Slots are recycled.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
AIX threads
TID format
listed with kdb
-- continued
The following is a 64-bit slot in the thread table listed with kdb; the TID is
002143 HEX =>, the index = 21, and the COUNT= 43, 21 hex = 33
decimal. According to the figure, this is the slot number in the thread table;
the value is listed in the next line of the output from kdb.
(0)> thread 33
SLOT NAME
pvthread+001080
STATE
TID PRI
33 sendmail SLEEP 002143 03C
RQ CPUID
CL WCHAN
0
0
If we look in the memory at address pvthread+0001080 we can se the 64-bit
TID structure.
(0)> d pvthread+001080
pvthread+001080: 0000 0000
0000 2143
0000 0000
0000 0000
(0)>
-20 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Thread Concepts
Threads
concepts
• An application is said to be thread safe when multiple threads in a
process can run the application successfully without data corruption.
• A library is thread safe when multiple threads can be running a routine
in that library without data corruption (another word for this is reentrant).
• A kernel thread is a thread of control managed by the kernel.
• A user thread is a thread of control managed by the application.
• User threads are attached to kernel threads to gain access to system
services.
• In a multi-threaded system such as AIX:
Thread
mapping
models
-
The process is the swappable entity.
-
The thread is the schedulable entity.
• User threads are mapped to kernel threads by the threads library. The
way this mapping is done is called the thread model. There are three
possible thread models, corresponding to three different ways, to map
user threads to kernel threads:
•
M:1 model
•
1:1 model
•
M:N model.
• The AIX Version 4.1 and later threads support is based on the OSF/1
libpthreads implementation. It supports what is referred to as the 1:1
model. This means that for every thread visible in an application, there
is a corresponding kernel thread. Architecturally, it is possible to have a
M:N libpthreads model, where "M" user threads are multiplexed on "N"
kernel threads. This is supported in AIX 4.3.1 and AIX 5L.
• The mapping of user threads to kernel threads is done using virtual
processors. A virtual processor (VP) is a library entity that is usually
implicit. For a user thread, the virtual processor behaves as a CPU for a
kernel thread. In the library, the virtual processor is a kernel thread or a
structure bound to a kernel thread.
• The libpthreads implementation is provided for application developers to
develop portable multi-threaded applications The libpthreads.a library
has been written as per the POSIX 1003.4a Draft 10 specification in AIX
4.3. Previous versions of AIX support the POSIX 1003.4a Draft 7
specification. The libpthreads is a linkable user library that provides user
space threads services to an application. The libpthreads_compat.a
provides the POSIX 1003.4a Draft 7 specification pthreads model on
AIX 4.3.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Threads Models
M:1 threads
model
In the M:1 model, all user threads are mapped to one kernel thread and all
user threads run on one VP. The mapping is handled by a library
scheduler. All user threads programming facilities are completely handled
by the library. This model can be used on any systems, especially on
traditional single-threaded systems.
User Threads
User Threads
Library Scheduler
VP
Threads Library
Kernel Thread
M:1 Threads Model
Continued on next page
-22 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Threads Models
1:1 threads
model
-- continued
In the 1:1 model, each user thread is mapped to one kernel thread and
each user thread runs on one VP. Most of the user threads programming
facilities are directly handled by the kernel threads.
User Threads
VP
User Threads
VP
VP
Threads Library
Kernel Threads
1:1 Threads Model
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Threads Models
M:N threads
model
-- continued
In the M:N model, all user threads are mapped to a pool of kernel threads
and all userthreads run on a pool of virtual processors. A user thread may
be bound to a specific VP, as in the 1:1 model. All unbound user threads
share the remaining VPs. This is the most efficient and most complex
thread model; the user threads programming facilities are shared between
the threads library and the kernel threads.
User Threads
Library Scheduler
VP
VP
VP
Threads Library
Kernel Threads
M:N Threads Model
-24 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Thread states
Thread states
In AIX, the kernel allows many threads to run at the same time, but there
can only be one thread executing on each CPU at a time. The thread state
is kept in t_state in the thread table (for detailed information look in the /
usr/include/sys/thread.h file).
Each thread can be in one of the following five states:
• Idle
• Ready to run
• Running
• Sleeping
• Stopped
• Swapped
• Zombie
Idle state
When processes and threads are being created, they are first in the idle
state. This state is temporary until the fork mechanism is able to allocate
all of the necessary resources for the creation and fill in the thread table for
a new thread.
Ready to run
Once the new thread creation is completed, it is placed in the ready to run
state. The thread waits in this state until the thread is ran. When the thread
is running, it continues to run until it has used a time slice, gives up the
CPU or is preempted by a higher priority thread.
Running
thread
A thread in the running state is the thread executing at the CPU. The
thread state will change between running and ready to run until the thread
finishes execution; the thread then goes to the Zombie state
Sleeping
Whenever the thread is waiting for an event, the thread is said to be
sleeping.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Thread states
-- continued
Stopped
A stopped thread is a thread stopped by the SIGSTOP signal. Stopped
threads can be restarted by the SIGCONT signal.
Swapped
Though swapping takes place at the process level and all threads of a
process are swapped at the same time, the thread table is updated
whenever the thread is swapped.
Zombie
The zombie state is a intermediate state for the thread lasting only until the
all resources owned by the thread are given up.
State
transitions for
AIX threads
The illustration show the states for AIX threads. Threads are typically
changing between running, ready to run, sleeping and stopped during the
life time of the thread.
fork()
Being Created
Ready to run
Sleeping
Running
Swapped
Zombie
Stopped
Non existing
-26 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Thread Management
Thread /
process
relationship
• The diagram below shows how the process shares most of the data
among the threads; although each thread has its own copy of the
registers, some kernel thread have specific data, and therefore have a
private stack. Thus, data can be passed between threads via global
variables.
• A conventional unithreaded UNIX process can only harm itself (if
incorrectly coded).
• All threads in a process share the same address space, so in an
incorrectly coded program, one thread can damage the stack and data
areas associated with other threads in that process.
• Except for such areas as explicitly shared memory segments, a process
cannot directly affect other processes.
• There is some kernel data that is shared between the threads, but the
kernel also maintains thread specific data.
• Per-process data is needed even when the process is swapped out is in
the pvproc structure. The pvproc structure is pinned.
• Per-process data is needed only when the process is swapped in is in
the user structure.
• Per-thread data is needed even when the process is swapped out is in
the pvthread structure. The pvthread thread structure is pinned.
• Per-thread data is needed only when the process is swapped in is in the
uthread structure.
Data placement overview
Thread
Thread
Kernel
Process
Data
Registers
Registers
Registers
BSS
Program
Data
Stack
Stack
Stack
Kernel
Thread
Data
Kernel
Thread
Data
Kernel
Thread
Data
Code
© Copyright IBM Corp. 2000
Thread
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process swapping
Process
swapping
-28 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Thread Scheduling
Thread
scheduling
Scheduling and dispatching is the ability to assign CPU time to threads in
the system in a efficient and fair way. The problem is to design the system
to handle many simultaneous threads and at the same time still be
responsive to events.
Clock tics and
time slices
The division of time among the threads on the AIX system relies on clock
tics. Every 1/100 of a second, or 100 times a second, the dispatcher is
called and does the following:
• Increases the running tic counter for the running process.
• Scans run queues for the thread with the highest priority.
• Dispatchs the most favored thread.
Every real second the scheduler is awake, it recalculates the priority for all
threads.
Thread priority
• AIX priority has 128 (0-127) levels that are called run queue levels.
• The higher the run queue level, the lower priority.
• Priority 127 can only be used by the wait process.
• User processes can get priority changed from -20 to + 20 levels
(renice).
• User processes are in the range 40 - 80.
• A clock tick interrupt decreases thread priority.
• The scheduler (swapper) increases thread priority.
The priority is based on the basic priority level, the initial nice value, the
renice value and a penalty.
Penalty based on
runtime
Renice value
-20 - +20
Higher value = lower priority
Nice value
default = 20
Base Priority
default value = 40
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Thread Scheduling
Thread
dispatching
-- continued
• Dispatcher chooses the highest priority thread to execute.
• Threads are the dispatchable unit for the AIX scheduler.
• Each thread has its own priority (0-127) and scheduling algorithm.
• There are three Scheduling algorithms:
SCHED_RR
threads
scheduling
algorithms
-
SCHED_RR Round Robin
-
SCHED_FIFO
-
SCHED_OTHER
• SCHED_RR
SCHED_FIFO
threads
scheduling
algorithms
SCHED_
OTHER
threads
scheduling
algorithms
-
This is a Round Robin scheduling mechanism in which the thread is
time-sliced at fixed priority.
-
This scheme is similar to creating a fixed priority, real time process.
-
The thread must have root authority to be able to use this
scheduling mechanism.
• SCHED_FIFO
-
A non-preemptive scheduling scheme.
-
The thread runs at fixed priority and is not time-sliced.
-
It will be allowed to run on a processor until it voluntarily
relinquishes by blocking or yielding.
-
A thread using SCHED_FIFO must also have root authority to use
it.
-
It is possible to create a thread with SCHED_FIFO that has a high
enough priority that it could monopolize the processor.
• SCHED_OTHER
-
The default AIX scheduling.
-
Priority degrades with CPU usage.
Continued on next page
-30 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process and Thread Scheduling
Thread
scheduling
Guide
-- continued
Like most UNIX systems, AIX uses a multilevel round-robin model for
process and thread scheduling. Processes and threads at the same
priority level are linked together and placed on a run queue. AIX has 128
run queues, 0-127, each representing one of the 128 possible priorities.
When a process starts running is determined by a given priority based on
the nice value, and the process is linked with other processes at the same
level. As the process is running and consumes CPU resources, the priority
decreases until it it finishes, or until the priority is so low that other
processes get CPU time. If a process does not run, the priority increases
until it can get CPU time again. The drawing below illustrates the 128 run
queue levels and six processes: three at priority 60 and three at 70.
0
20
40
60
80
100
120
127
Idle process
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process and Thread Scheduling
Thread
scheduling
algorithm
-- continued
The scheduler is using the following algorithm to calculate priorities for the
running processes:
For every clock tick (1/100 sec.):
• The running thread is charged for one tick.
• The dispatcher is called, scans the run queues, and dispatches the one
with the highest priority.
The scheduler runs every second:
• It calculates new priority for all threads.
• For each thread set, the number of used ticks is equal to (used ticks)* d/
32 where 0 <= d <= 32.
The algorithm for calculating the priority is:
• new_nice = 60 + 2* nice if nice > 0
• new_nice = 60 + nice if nice < 0
• Priority = used ticks * (new_nice + 4) / 64 * r/32 + new_nice, where
0<=r<=32
Invariants:
-20 <= nice <= 20
0 <=
r
<= 32
0 <= d
<= 32
0<= ticks <= 120
0 <= p
<= 126
The r and d controls how a process is impacted by the run time; r impacts
how severely a process is penalized by used CPU time, while d controls
how fast the system “forgives” previous CPU consumption.
The r and d can be set by the schedtune [-r <r_val] [-d d_val] command.
Continued on next page
-32 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
The Dispatcher
The dispatcher
The dispatcher runs under the following circumstances:
• A time interval has passed. (1/100 sec.)
• A thread voluntarily gives up the CPU.
• A thread is returning to user mode from kernel mode.
• Another thread has been made runnable (awake).
Context switch
procedure
The context switch procedure consists of:
• Saving the machine state of the departing thread.
• Recalling the machine state of the selected thread.
• Mapping the process private data and other virtual space of the
selected thread.
• Switching the CPU to execute with the selected thread's registers.
Context switch
The procedures switches context to make a different thread execute:
• As a thread executes in the CPU, its priority becomes less favored.
• The scheduler re-calculates the priority of the executing thread and
measures the new priority against the priorities of the threads that
are runnable.
• In AIX, the run queues are divided into 128 separate priority queues
with priority 0 being the most-favored priority and priority 127 the
least-favored.
• Threads at the same priority level are on the same run queue for
quick determination of the next runnable process.
• All of the threads on a more-favored priority queue run before
threads on a less-favored priority queue.
• Queue 127 contains the wait threads. There is one wait thread per
CPU, and these run only when there are no other runnable threads.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
The Dispatcher
Thread
preemption
-- continued
In AIX, the kernel allows preemption of both user and kernel threads.
• Preemption allows the kernel to respond to real-time processes
much faster.
• On most UNIX systems, when a thread is in kernel mode, no other
thread can execute until the thread in kernel mode returns to user
mode or voluntarily gives up the CPU.
• In AIX, other higher priority threads may preempt threads running in
kernel mode.
• This feature supports real-time processing where a real-time process
must respond to an action immediately.
• Some sections of code have been determined to be critical sections
where preemption is not possible because preemption may cause
inconsistent kernel data structures. These sections are protected
either by preventing preemption (by disabling interrupts) or by
holding a lock.
• The kernel can use locks to serialize access to global kernel data that
could be corrupted by preemption.
• The thread holding the lock for a piece of data is guaranteed to run at
a higher priority than the set of threads waiting for the lock. This is
called priority promotion.
• However, other threads running at higher priority and not asking for
the lock on the same piece of data can preempt the locking thread.
The MP
scheduler/
dispatcher
• Hard Cache Affinity - The ability to bind a thread or process to a
processor.
• Soft Cache Affinity - An attempt to run a thread or process on the same
processor.
• Support funneling threads - Funneling threads is a method to run nonMP-safe threads on MP hardware.
Continued on next page
-34 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
The Dispatcher
The MP
dispatcher/
scheduler
Guide
-- continued
AIX Scheduling uses a time-sharing, priority based scheduling algorithm.
This algorithm favors threads that do not consume large amounts of the
processor, as the amount of processor time used by the thread is included
in the priority recalculation equation. Fixed priority threads are allowed.
The priority of these threads do not change regardless of the amount of
processor time used.
There is one global priority-based, multi-level run queue (runq). All threads
that are runable are linked into one of these runq entries. There are
currently 128 priorities (0-127). The scheduler periodically scans the list of
all active threads and recalculates thread priorities based on the amount of
processor time used.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-35 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
AIX run queues
Multiple run
queues (MRQ)
• AIX 4.3.3. uses multiple specialized run queues instead of just one
global queue.
• Each processor has its own local run queue, and each node has a
global run queue.
• Processors dispatch threads from the local and the global run queue.
Node 0
CPU 0 - 3
Node 1
CPU 4 - 7
Node 2
CPU 8 - 11
RQ RQ RQ RQ
0
1
2
3
RQ RQ RQ RQ
4
5
6
7
RQ RQ RQ RQ
8
9
CPU CPU CPU
0
1
2
Global run
queue
CPU
3
CPU CPU CPU
4
5
6
CPU
7
10
CPU CPU CPU
8
9
10
11
CPU
11
• Fixed priority POSIX-compliant threads
• Load balancing of newly created threads and very low priority threads
• Process executed with RT_GRQ=ON exported
• Threads that are bound to a processor are never placed in the global
run queue
The fixed priority threads guarantees strict priority order execution. Load
balancing is achieved with new and low priority threads. New threads can
be picked up by any CPU because they have not run yet, and the cache
penalty is therefore small. Also, low priority threads can easily be moved
as they do not have data in cache.
If processes has the variable RT_GRQ=ON set, they will sacrifice cache
optimization for best possible real-time behavior. That is, the process will
be on the Global Run Queue and run on the first available CPU. Threads
can be bound to one CPU and will then never be on the global run queue.
Continued on next page
-36 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
AIX run queues
Local run
queues
-- continued
• Local run queues reduce lock contention
• Shorter and simpler run queues scans
• Stronger implied affinity
• Reduced cache contention in the kernel
Each local run queue has its own lock. This reduces the lock contention,
and make the lock handling faster. The local queue makes the scan faster
because there are no special handling of bound threads, and simple
handling of soft affinity with one CPU per run queue. The kernel cache
contention is reduced because each CPU updates the dispatcher state,
and the structures for threads in the local run queue are more likely to be in
the local cache.
Initial load
balancing
When new unbound threads are created, they should initially be placed so
that the system load remains balanced. This has to be handled differently
for new processes and for additional threads in an existing process.
If the thread is the initial thread in a new process:
• Choose a run queue; round-robin first among all nodes, and secondly
within the run queues of the chosen node.
• Look for an idle CPU in the chosen run queue.
• Look for an idle CPU in the chosen node.
• Look for an idle CPU anywhere on the system.
• Otherwise, add to round_robin global run queue.
If the new thread is an additional thread for an already existing process
• Choose a run queue; round-robin among the run queues in the
process’ node.
• Look for an idle CPU in the chosen run queue.
• Look for an idle CPU in the chosen node.
• Otherwise, add to global run queue for this node.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-37 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
AIX run queues
Idle load
balancing
-- continued
Idle load balancing occurs when a CPU goes idle, and starts looking for
work in other run queues. The criteria for permitting a thread steal are:
• Foreign run queue threads are greater than 1/4 load factor of the
node.
• There is at least one stealable (unbound) thread available.
• There is at least one unstealable (bound) thread available.
• The number of threads stolen from this run queue during the current
clock interval is less than 20.
• Should multiple run queues meet these criteria, the one with the most
threads will be used.
• If this run queue’s lock is available, its best priority unbound thread
will be stolen, assuming its p_lock_d is available.
• Note that failure to lock the run queue or the thread will cause the
dispatcher to loop through waitproc, thereby opening up a periodic
enablement window.
-38 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process and Threads data structures
Process and
thread
management
data structure
overview
Four main data structures is used for process management:
• proc
• thread
• user
• uthread
The figure below show how the tables are linked together.
/dev/kmem
Process Table
pvproc
user memory
Thread Table
pvthread
Process Data
Ublock
User Area
pvproc
Pvthread
pvproc
pvthread
user
uthread
pvthread
Uthread
Uthread
Kernel Stack Kernel Stack
pvproc
pvthread
pvproc
pvthread
Thread Stack Thread Stack
Gobal Data
Process Text Segment
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-39 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process and Threads data structures
Thread
management
data structure
overview
-- continued
The diagram above shows that the thread structures contain
pointers to all the other structures required to run that particular
thread. This is a reflection of the fact that the thread is the
schedulable entity, and the system must be able to access all
structures from the pointers in the thread table. The thread table are
doubly and circularly linked to all other threads for a particular
process. Note that the ublock structure contains the user structure
plus uthread structures for the initial thread. The uthread structures
for all other threads are in the uthread (and kernel thread stacks)
segment. The first uthread structure is kept separate within in the
ublock so that the fields it contains can be addressed directly and
such that fork and exec can operate with only the process private
segment to deal with.
The proc and thread structures are maintained in the kernel extension
segment as a part of the process and thread tables of the kernel. Every inuse entry in these tables is pinned, such that the information there is
always available to the kernel. The user and uthread structures are
maintained in the process private segment of the corresponding process.
These structures are only pinned when the process is not swapped out.
When the process is swapped out, they are unpinned.
Process and
thread links
The previous diagram shows how the tables are linked together. Each
process in the system has an entry in the process table. Each process
entry has a pointer to the list of threads for the process, and the thread list
has a pointer back to the process table. The thread list is a double circular
linked list of all the threads owned by the process, and the pvthreads
entries point to the user area and uthread field in the process data area.
Continued on next page
-40 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process and Threads data structures
Proc structure
fields and
pointers C
structures
-- continued
The following is an extract of the fields in the proc table to show the
pointers. Note that each entry in the proc table starts with a pointer to a
pvproc structure (we will later discuss the pvproc structure). The proc table
holds the number of threads, and the pvproc table has a pointer
pv_threadlist that points to the first thread for the process in the thread
table. A complete listing of the structures can be found in the file /usr/
include/sys/proc.h.
struct
proc {
struct pvproc
*p_pvprocp;
/* my global process data*/
pid_t
p_pid;
/* unique process identifier
*/
uint
p_flag;
/* process flags
*/
/* thread fields */
ushort
p_threadcount;
/* number of threads
*/
ushort
p_active;
/* number of active threads
*/
.......
};
struct
pvproc {
/* identifier fields */
pid_t
pv_pid;
/* unique process identifier
*/
pid_t
pv_ppid;
/* parent process identifier
*/
/* thread fields */
struct pvthread *pv_threadlist; /* head of list of threads */
.......};
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-41 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process and Threads data structures
Thread table
fields and links
to the process
table and the
ublock
-- continued
Like the process table, the thread table is divided into two tables a
pvthread table and a thread table. The complete structures can be found in
the file /usr/include/sys/thread.h. The structures listed contains only
selected variables.
The thread and pvthread structures have a pointer, *tv_pvprocp back to
the owner process, pointers *uthreadp and *userp to the user thread and
user area, and the thread list linked in a circular doubly linked list (with the
*prevthread and *nextthread fields).
struct
thread {
struct pvthread *t_pvthreadp;
/* my pvthread struct
*/
struct t_uaddress {
struct uthread *uthreadp;
/* local data
struct user *userp;
/* owner process’ ublock (const)*/
}
*/
t_uaddress;
......
struct
pvthread {
/* identifier fields */
tid_t
tv_tid;
/* unique thread identifier
/* related data structures
*/
*/
struct thread
*tv_threadp;
/* my pvthread struct
*/
struct pvproc
*tv_pvprocp;
/* owner process (global data)
*/
struct {
struct pvthread *prevthread;/* previous thread
*/
struct pvthread *nextthread;/* next thread
*/
}
tv_threadlist;
/* circular doubly linked list
*/
...
-42 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process and Threads data structures addresses
Process and
thread tables’
addresses in
the kernel
AIX 5L has 64-bit kernel and the addresses are 64-bit long. Both process
and thread tables are kept in the kernel extension segment at fixed
addresses.
• The proc table starts at 0xF100008080000000.
• The thread table starts at 0xF100008090000000.
Both tables are maintained as arrays.
• Entries are called “slots.”
• Slot number can be derived from PID or TID (see the example).
AIX 4.3.3 is a 32-bit kernel and the addresses are only 32-bit long the
values for an AIX 4.3.3 32-bit kernel are:
• The proc table starts at 0xe2000000.
• The thread table starts at 0xe6000000.
Both tables are maintained as arrays.
• Entries are called “slots.”
Slot number can be derived from PID or TID bit 8 - 23. See the example
and list from the process table on an AIX 5L power system
• Generation count for each slot is incremented every time a PID or
TID is created in that slot.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-43 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Process and Threads data structures addresses
Looking at AIX
4 process
structures with
kdb
-- continued
Looking at the process table with kdb, we can tell that there is a difference
between AIX 4 and AIX 5. List the process table with the p subcommand in
kdb. The process table starts at address proc and the process slot used by
kdb is 7936, which is offset by 326000 (hex) from the start of the process
table. The size of proc is 326000 (hex) / 7936 (dec) = 416 (dec) = 1A0
(hex).
SLOT NAME
STATE
proc+326000 7936*kdb_up
00
PID
PPID
PGRP
UID
ADSPACE CL
ACTIVE 1F001A 0123A 1F001A 00000 00000 00001302
The size of each process slot can be verified with the p
the following list, each slot is offset by 1A0 bytes
SLOT NAME
EUID
STATE
PID
PPID
PGRP
UID
*
subcommand. In
EUID
ADSPACE CL
proc+000000
0 swapper ACTIVE 00000 00000 00000 00000 00000 0000780F 00
proc+0001A0
1 init
ACTIVE 00001 00000 00000 00000 00000 0000500A 00
proc+000340
2 wait
ACTIVE 00204 00000 00000 00000 00000 00008010 00
proc+0004E0
3 netm
ACTIVE 00306 00000 00000 00000 00000 0000B817 00
And the location of proc in memory can be retrieved with the nm
subcommand.
(0)> nm proc
Symbol Address : E2000000
TOC Address : 001F9EF8
Continued on next page
-44 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Process and Threads data structures addresses
Looking at AIX
5 process
structures with
kdb
-- continued
The same lists will look different on an AIX 5 system. First, on a list of the
proc table, we can tell that the structure used is no longer proc but pvproc,
and each pvproc slot is 6680 (hex) / 41 (dec) = 280 (hex) long.
(0)> p
SLOT NAME
STATE
ADSPACE
CL #THS
ACTIVE 0002996 00037D8 00000000200040AA
0 0001
pvproc+006680 41*kdb_64
PID
PPID
Listing the first three slots shows that the offset is 280(hex) between the
slots.
(0)> p *
SLOT NAME
STATE
PID
PPID
ADSPACE
CL
0
pvproc+000000
0 swapper
ACTIVE 0000000 0000000 0000000000000B00
pvproc+000280
1 init
ACTIVE 0000001 0000000 000000000000E2FD 00
pvproc+000500
2 wait
ACTIVE 0000204 0000000 0000000000001B02
0
The pvproc address in memory is found using the nm command.
(0)> nm pvproc
Symbol Address : F100008080000000
TOC Address : 0046AC80
(0)>
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-45 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
AIX 5 process and Thread data structures
Process data
structure
changes in AIX
5
-- continued
The changes in the process table are made to support the NUMA (NonUniform memory) structure in AIX 5L.
A NUMA system consist of one or more separate nodes connected by a
very fast connection. The nodes operates as one computer, running one
copy of AIX. The name NUMA refers to the fact that the memory access
time is not constant. A CPU accessing memory on its own node will get the
memory fast (accessed via the local bus). A CPU accessing remote
memory will have to get the data from a remote node, and the access will
be slower.
In order to make the system efficient, we want to keep all parts of a
process close together so that memory access is fast; therefore, the proc
structure has been rearranged and divided into two parts. Struct pvproc,
that holds global process data and the rest, is still in struct proc. This
change allows the NUMA system to move processes around between
CPU’s or “QUADS,” and still have most of the process table local to the
process. However, some of the process table must be kept at the main
node in a NUMA system.
Because of things like shared memory, processes can form migration
groups. These are groups of processes, shared memory, files, and so on.
that are logically attached to each other. The most common form of logical
attachment involves one being intrinsically tied in with another process.
For example, a process that creates a shared memory segment is logically
attached to it. If another process uses the shared memory segment, it is
logically attached to it, and as a result is in a migration group with the first
process. Additionally, the user is allowed to create logical attachments
between items through the NUMA APIs
The proc structure in an AIX 5 system starts with a pvproc structure and
continue with process flags. The start of the structure is listed here; for a
full listing, see the file /user/include/sys/proc.h.
struct proc {
struct pvproc
*p_pvprocp;
/* my global process data*/
pid_t
p_pid;
/* unique process identifier*/
uint
p_flag;
/* process flags
*/
Continued on next page
-46 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
AIX 5 process and Thread data structures
Process ID PID
and process
table slot
number
-- continued
The process ID or thread ID is composed of process slot number and a
generation count, bit 0 tells us if it is a PID or a TID (all PID’s are even).
The next 7 bits are the generation count; the generation count prevents the
rapid reuse of process IDs. Bits 8 to 23 is the slot number in the process
table. The information can be verified from the pvproc list, where bits 8-23
in the PID field match the process slot number in the pvproc table.
63
24 23
8
7
1
0
0000000 Slot Number Generation Count 0 if PID
1 if TID
Process table example from an AIX 5L system.
pvproc+001180
pvproc+001400
pvproc+001900
pvproc+001B80
pvproc+002080
pvproc+002580
© Copyright IBM Corp. 2000
SLOT
7
8
10
11
13
15
NAME
gil
wlmsched
shlap64
syncd
lvmbb
errdemon
STATE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
ACTIVE
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
PID
000070E
0000810
0000AD2
0000B4E
0000D22
0000F50
-47 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
What is new in AIX 5
Priority boost
Priority boost is a facility added that ensures that higher priority processes
get CPU time, and that the time such processes have to wait for lower
prioritized processes is minimized. Priority boost was implemented in AIX
4.3 and is further enhanced in AIX 5.
The background for priority boost is demonstrated in the following
scenario. Assume that we have three resources, Locks A,B, and C. Two
processes, process 1 and 2, both want to get access to resource B, but
process 1, a low priority process, has the lock, and process 2 has to wait.
However, another process (process 3) has higher priority than 1and gets
most of the CPU time. In this scenario, the high priority process 2 is waiting
on the lower priority process 1 because it holds a lock. To resolve this
situation, priority boost was added to AIX 4.3.
Lock A
Process 1
Low priority
Lock B
Process 2
High priority
Lock C
Process 3
Medium priority
Priority boost increases the process priority of process holding locks:
• When a process has to wait for a lock, it increases the priority of the
process that has the lock to its own priority.
• Other processes waiting for the same lock also get increased priority.
• Only the kernel thread is increased; as soon as the altered process
leaves the kernel, the priority is set back to the original value
Continued on next page
-48 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
What is new in AIX 5
-- continued
User area in
User64
The user structure is much larger in the 64-bit kernel than in the 32-bit
kernel. To improve efficiency and performance in the 32-bit kernel, two
structures are maintained: a 32-bit and a 64-bit. This ensures that the
kernel does not copy data areas which are not used.
What is
system hang
detection and
why do we
need it?
Runaway processes and hanging system are hard to detect from locked
systems, and methods to detect the runaway process are needed.
• Misbehaving high priority applications are a recurring problem.
• When one or more processes or threads are stuck in the running
state, they can prevent any other lower priority threads from running.
• If the priority is above the default user priority, the machine can
appear to be hung.
• The hung situation is very difficult to debug since the administrator
can not tell what is happening on the system.
The solution to the hang problem is the system hang detection. It is
implemented by the shdaemon, which runs at the highest user priority.
Shdaemon monitors the lowest priority process that run on the system in a
given period of time, and if the system fails to run process below a given
threshold, an action is taken. The system hang detection can be set by the
shconf command, but the easiest way is to use the smit panel. There are
five distinct actions that can be taken, and for each of them a timeout value
and a threshold priority value can be set.
Log an Error in the Error Log
Detection Time-out
Process Priority
[disable]
[120]
[60]
Display a warning message on a console
Detection Time-out
Process Priority
Terminal Device
[disable]
[120]
[60]
[console]
Launch a recovering getty on a console
Detection Time-out
Process Priority
Terminal Device
[enable]
[120]
[60]
[console]
Launch a command
Detection Time-out
Process Priority
Script
[disable]
[120]
[60]
[ ]
Automatically REBOOT system after Detection
Detection Time-out
Process Priority
[disable]
[300]
[39]
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-49 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Signals
What are
signals?
• Signals are a way of notifying a process or thread of a system event.
• Signals also provide a means of interprocess communication.
• A signal is represented as a single bit within a bit field in the kernel.
• The bit field used for signals is 64-bit wide, but only about 40 signals are
defined.
• AIX 4.3.3 defines only 37 signals for the user.
• AIX 5 has 44 defined signals, but three of them are not used.
Types of
signals
• There are two types of signals in AIX: synchronous and asynchronous.
• Synchronous signals are only delivered to a thread, usually as a result
of an error condition or exception caused by the thread, that is, SIGILL
is delivered to a thread that tries to execute an illegal instruction.
• Asynchronous signals are generated externally to the current thread or
process.
• Asynchronous signals may be delivered to a process (that is, kill() or to
another thread within the same process (that is, thread_kill() or tidsig() ).
Signal types
Signals may be generated for a number of reasons:
• An exception, as segment violation
• An Interrupt, as a clock tick
• An Alarm, as when the timer expires
• Process management, as when a child process dies
• Device I/O, as data ready
• Signals from another process
Signal
mechanism
When an event triggers a signal, the kernel sets the corresponding bit in
the pending signal bit field for the process (p_sig) or thread (t_sig).
• All signals are enabled by default, and when returning from the kernel,
threads are looking for signals.
• If the signal is being ignored (masked), nothing happens.
-50 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Signal handling
Signal
delivering
When a signal has been generated but not yet handled, it is said to be
pending.
• Pending signals are detected when returning from a system call.
• Pending signals are detected when resuming in user mode.
• Pending signals are detected entering or during an interruptible
sleep.
• Signals may be caught, blocked or ignored by a process.
Signal
handling
Signal handling is done at the process level and signal masking is done at
the thread level. That is, each thread in a process must use the signal
handler set up by the process, but each has its own signal mask.
• If a pending signal is not specifically handled by the process, it is
delivered to all threads in the process.
• If the signal is handled by the process, the signal is delivered to the
thread that is not blocking the signal.
• If all threads are blocking a signal, it is left pending for the process
until one thread unmasks the signal or the signal is removed from the
pending list.
• If more than one signal is pending, only one is chosen for delivery at
a time.
• When a signal is being handled, it is moved to the p_cursig or
t_cursig field in the pvproc or pvthread structure.
Signal handler
routines
There is a default system handler for all signals, but most signals have a
local system handler routine, or the signal is ignored or blocked.
• SIGKILL and SIGSTOP can not be handled by a local routine, these
signals will always be handled by the system default routine.
• SIGKILL and SIGSTOP can not be blocked the process will always
handle the signal.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-51 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Signal handling
Signal actions
-- continued
The default action for a signal depends on the signal, but may be one of
the following:
• Abort: This will generate a core dump and terminate the process.
• Exit: This will terminate the process without generating a core dump.
• Ignore: The signal is ignored.
• Stop: This action will suspend the process or thread.
• Continue: This action will resume a suspended process or thread.
-52 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Signals
Signals
SIGHUP
1
SIGINT
/* hangup, generated when terminal disconnects */
2
SIGQUIT
/* interrupt, generated from terminal special char */
3
SIGILL
4
/* (*) quit, generated from terminal special char */
/* (*) illegal instruction (not reset when caught)*/
SIGTRAP
5
/* (*) trace trap (not reset when caught) */
SIGABRT
6
/* (*) abort process */
SIGEMT
7
/* EMT intruction */
SIGFPE
8
/* (*) floating point exception */
SIGKILL
9
/* kill (cannot be caught or ignored) */
SIGBUS
10
/* (*) bus error (specification exception) */
SIGSEGV 11
SIGSYS
/* (*) segmentation violation */
12
/* (*) bad argument to system call */
SIGPIPE 13
/* write on a pipe with no one to read it */
SIGALRM 14
/* alarm clock timeout */
SIGTERM 15
/* software termination signal */
SIGURG
/* (+) urgent contition on I/O channel */
16
SIGSTOP 17
/* (@) stop (cannot be caught or ignored) */
SIGTSTP 18
/* (@) interactive stop */
SIGCONT 19
/* (!) continue (cannot be caught or ignored) */
SIGCHLD 20
/* (+) sent to parent on child stop or exit */
SIGTTIN 21
/* (@) background read attempted from ctl terminal*/
SIGTTOU 22
SIGIO
23
/* (@) background write attempted to control terminal */
/* (+) I/O possible, or completed */
SIGXCPU 24
/* cpu time limit exceeded (see setrlimit()) */
SIGXFSZ 25
/* file size limit exceeded (see setrlimit()) */
SIGMSG
/* input data is in the ring buffer */
27
SIGWINCH 28
SIGPWR
29
/* (+) window size changed */
/* (+) power-fail restart */
SIGUSR1 30
/* user defined signal 1 */
SIGUSR2 31
/* user defined signal 2 */
SIGPROF 32
/* profiling time alarm (see setitimer) */
SIGDANGER 33
© Copyright IBM Corp. 2000
/* system crash imminent; free up some pg space */
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-53 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
SIGVTALRM 34
/* virtual time alarm (see setitimer) */
SIGMIGRATE 35 /* migrate process */
SIGPRE
36
/* programming exception */
SIGVIRT 37
/* AIX virtual time alarm */
SIGKAP
/* keep alive poll from native keyboard */
60
SIGGRANT SIGKAP /* monitor mode granted */
SIGRETRACT 61 /* monitor mode should be relinguished */
SIGSOUND 62
SIGSAK
63
/* sound control has completed */
/* secure attention key */
Continued on next page
-54 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Signals
-- continued
Signal data
structures
The file /usr/include/sys/proc.h defines the proc structure and the following
information about signals is kept in the proc structure.
/* Signal information */
sigset_t
Signals
p_sig;
/* pending signals */
sigset_t
p_sigignore;
/* signals being ignored */
sigset_t
p_sigcatch;
/* signals being caught */
sigset_t
p_siginfo;
/* keep siginfo_t for these */
• A signal is a bit set in an array with enough bits set aside for each signal
number.
• The bits are turned on by kernel code as the process is executing in
kernel mode or by the processing of interrupts that are determined to be
assigned to the process.
• Signals can also be sent from one process to another process through
the use of system calls.
• Signals are delivered to the process when:
-
The process returns to the User Protection Domain.
-
There is a transition from ready-to-run state to running state.
• To deliver a signal, the kernel checks whether the process is receiving
the signal.
• If the signal is being received, the kernel sets the receiving process to
perform the appropriate action.
• The appropriate action may be to invoke the signal handler for that
particular signal, kill the process, or ignore the signal.
• If the signal is blocked by the process, it is left pending until the process
is no longer blocking the signal.
• Signals can be delivered to a group of processes.
• Signals can be sent to process or thread.
• Thread receives signal if:
• A signal is synchronous and attributable to particular thread. For
example: SIGSEGV.
• A signal is sent by thread in the same process via thread_kill system
call.
• Otherwise, the signal goes to process.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-55 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Signals
Signals to a
process
-- continued
• If a signal is not being caught, a signal action applies to entire process.
-
Every thread is terminated, stopped, or continued, depending on
action.
• If a signal is being caught:
-56 of 62 AIX 5L Internals
-
Pick one thread that is not blocking signal to receive it.
-
If all threads are blocking, a signal pending on process is sent.
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Exercises
Exercises after
this module
In this exercise, the student will be supplied with programs that will create
process and threads using the available thread models. The programs
should be very simple source and will be supplied to the student. Kernel
debugging tools (running on a live kernel) are then used to interrogate the
kernel structures associated with the process and threads of the program.
The first code example explores the fork() system call and how variables
are private to each process. The second example show how threads are
created and how global variables are shared because all threads share
user space, but local variables in functions are not shared because those
data are kept on the stack to make the procedure reentrant. The third
example is a signal handler example.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-57 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Exercises
C code
example to
explore fork()
and wait()
system calls
-- continued
Use C code to create siblings with the fork() system call notice that the
variable is private to each process.
#include <unistd.h>
int i;
int *statuslocation;
pid_t proc_id;
pid_t proc_id2;
main(argc,argv)
int argc;
char **argv;
{
int this=7;
proc_id=fork();
/* error routine */
if ( proc_id < 0 ) {
printf ("fork error \n");
exit (-1);
}
if ( proc_id > 0 ) {
this= this+4;
printf("waiting for child \n");
proc_id2 = wait(statuslocation);
printf("I’m Parent variable= %d \n",this);
exit (0);
}
if ( proc_id == 0 ) {
printf (" I’m the child proces \n");
sleep(1);
printf ("I’m the child the variable is %d\n",this);
printf ("I’m the child terminating\n");
exit (0);
}
}
Continued on next page
-58 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Guide
Exercises
C code to
explore the
thread system
call
-- continued
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
void *mythread(void *data);
int x = 0;
int main(void)
{
/* This will be an array holding the threads ids for each thread */
pthread_t tids[11];
int i;
/* We will now create the 5 threads. */
for(i=0;i<5;i++) {
pthread_create(&tids[i], NULL, mythread, NULL);
}
/* We will now wait for each thread to terminate */
for(i=0;i<5;i++)
{
/* this will block until the specified thread finishes execution.
* second argument to pthread_join can be a pointer that will have
* the return
value of thread stored in it */
pthread_join(tids[i], NULL);
}
return(0);
}
/* This is our actual thread function */
void *mythread(void *data)
{
int v;
printf (" x was %d v was %d , now change it
if (x
< 20)
x= 444;
if (v
< 20)
v= 444;
printf (" x is
",x,v);
%d v is %d \n",x,v);
pthread_exit(NULL);
}
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-59 of 62
Guide
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Exercises
-- continued
C sample code
to explore
process
renice, the
proces priority,
and the
program run
long time,
there is time to
look at the
process table
The program explores proces priority, the program run long time such that
ther are time to look at the process table, and the nice value with the ps
command.
int i,ii;
long ll;
long ll1();
main(argc,argv)
char *argv[];
int argc;
{
i=atoi(argv[1]);
ii = nice(i);
ll=1;
for (i = 1;i < 5000; i++) {
ll = ll1(ll);
ll++;
}
}
long ll1(long l1)
{
int e;
long l2,l3;
bb=l1;
for (e = 1;e < 50000; e++) {
l2 = sin(l3);
l3 = l2+l3;
}
return(l3);
}
Continued on next page
-60 of 62 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Exercises
C code
example to
explore signal
handling
Guide
-- continued
The Signal code sample catch the signals and print a message whenever
a signal is being caught. What happens if the same signal is being send
twice? And how can this behaviour be changed.
#include
#include
#include
#include
<stdio.h>
<fcntl.h>
<termio.h>
<signal.h>
int i;
void sig1(), sig2(), sig3();
main()
{
signal( SIGHUP,sig1);
signal( SIGINT,sig2);
signal( SIGQUIT,sig3);
for (i = 1;i < 100; i++) {
sleep(15);
printf ("been sleeping for 15 sec. \n");
}
}
void sig1()
{
printf("interrupt 1 modtaget \n");
}
void sig2() {
printf("interrupt 2 modtaget \n");
}
void sig3() {
printf("interrupt 3 modtaget \n");
}
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-61 of 62
Guide
-62 of 62 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, Proc_MGMT.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Unit 8. Memory Management
Objectives
After completing this unit, you should be able to describe the
common features of VMM on POWER and IA64:
• virtual memory
• page mapping
• memory objects
• VMM tuning parameters
• object types
• shared memory objects
References
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Overview of Virtual Memory Management
Introduction
Traditionally, the memory management component of the operating
system (VMM) is responsible for managing the system’s real memory
resources. Virtual memory systems provide the capability to run programs
whose memory requirements exceed the system’s real memory resources
by allowing programs to execute when they are only partially resident in
memory and by utilizing disk to extend memory.
Memory
management
The virtual memory system divides real memory into fixed- length pages
and allocates pages to program as it requires them. Such a system
allows multiple programs to reside in memory and execute simultaneously.
The virtual memory system is responsible for keeping track of which pages
of a program are resident in memory and which are on secondary storage
(disk).
It handles interrupts from the address translation hardware in the system
to determine when pages must be retrieved from secondary storage and
placed in real memory.
When all of real memory is in use, it decides which program’s pages are to
be replaced and paged out to secondary storage.
Each time a process access a virtual address, the virtual address is
mapped (if not already mapped) by the VMM to a physical address where
the data is located.
Access
Protection
The VMM also provides for access protection to prevent illegal access to
data. This protects programs from incorrectly accessing kernel memory or
memory belonging to their programs. Access protection also allows
programs to setup memory that may be shared between process.
VMM on
POWER
opposed to IA64 VMM
In this lesson the common feature of VMM on POWER and IA64 are
described. For the most part, the IA64 VMM design inherits design on the
Power architecture. The majority of data structures, the serialization
model, and the majority of code are common between the two. Separate
lessons will describe POWER and IA64 VMM context.
-2 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Memory Management Definitions
Introduction
The following terms relating to virtual memory concepts will be defined in
this section:
• Page
• Frame
• Address space
• Effective address
• Virtual memory
• Physical address
• Paging Space
Illustration
Follow this diagram as you read about the virtual memory concepts.
Physical
Memory
Virtual address
space
Process 1
Effective
address
Process 2
Paging
space
Page
Page is a fixed size chunk of contiguous storage that is treated as the
basic entity transferred between memory and disk. Pages stay separately
from each other, they do not overlap in virtual address space. AIX 5L uses
a fixed page size of 4096 bytes for both Power and IA64. The smallest unit
of memory managed by hardware and software is one page
Frame
The place in real memory used to hold the page is called frame. You can
think that the page is the collection of information and the frame is the
place in memory to hold that information.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Memory Management Definitions
Address Space
-- continued
Address space is the set of addresses available to a program that it can be
use to access memory. This lesson describes three types of address
space:
• Effective address space.
• Virtual address space.
• Physical address space.
Effective
Address
Effective address are the addresses reference by the machine instructions
of a program or kernel. The effective address space is the range of
addresses defined by the instruction set, 64-bits on AIX 5L. The effective
address space is mapped to different physical address space or disk files
for each process. Programs/process see one contiguous address space.
Virtual
Address
The virtual address space is the set of all memory objects that could be
made addressable by the hardware. The virtual address is a bigger (has
more address bits) than the effective address. Processes have access to a
limited range of virtual addresses given to them by the kernel.
Physical
Address
The physical address space is dependent on how much memory (memory
chips) are on the machine. Physical address space maps one- to- one with
the machine’s hardware memory.
Paging space
Paging space is disk area used by the memory manager to hold inactive
memory pages with no other home. In AIX the paging space is mainly used
to hold the pages from working storage (process data pages). If a memory
page is not in physical memory it may be loaded from disk, this is called a
page-in. Writing a modified page to disk is called a page-out.
-4 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Demand Paging
Introduction
AIX is a demand paging system. Physical pages (frames) are not allocated
for virtual pages until they are needed (referenced).
• Data is copied to a physical page only when referenced.
• Paging is done on the fly and is invisible to the user.
• Data comes from:
• A page from the page space.
• A page from a file on disk.
When a virtual address is referenced on a page that has no mapping to a
frame, the mapping is done on the fly and the page frame is loaded from
where it is mapped. The loading is invisible to the user process. Demand
paging saves much of the overhead of creating new processes because
the pages for execution do not have to be loaded unless they are needed.
If a process never uses parts of its virtual space, valuable physical memory
will never be used.
Page Faults
A page fault occurs when a program tries to access a page that is not
currently in real memory. Memory that has been recently used is kept in
real memory, while memory that has not been recently used is kept aside
in paging space.
For speed, most systems have the mapping of virtual addresses to real
addresses done in the hardware. This mapping is done on a page- bypage basis. When the hardware finds that there is no mapping to real
memory, it raises a page fault condition. The operating system software
must handle these faults in such a way that the page fault is transparent to
the user program.
Virtual Memory
manager
The job of a virtual memory management system is to handle page faults
so that they are transparent to the thread using virtual memory addresses.
Pool of
Physical Free
Pages
A pager daemon attempts to keep a pool of physical pages free. If the
number of pages available goes below a high- water mark threshold, the
pager frees the oldest (referenced further back in time) pages until a lowwater mark threshold is reached.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Demand Paging
-- continued
Pageable
Kernel
AIX’s kernel is pageable. Only some of the kernel in physical memory at
one time. Kernel pages that are not currently being used can be unused
can be paged out.
Pinned Pages
Some parts of the kernel are required to stay in memory because it is not
possible to perform a page-in when those pieces of code execute. These
pages are said to be pinned. The bottom halves of devices drivers
(interrupt processing) are pinned. Only a small part of the kernel is
required to be pined.
-6 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Memory Objects
Introduction
A fundamental feature of AIX 5L’s Virtual Memory Manager is the use of
addressable memory objects.
Objects
In AIX 5L provides access to a 256 MB objects called segments. The
predominant features these objects are:
• All objects are broken into pages.
• Objects can be shared among processes.
• Objects can grow by adding additional pages.
• Objects can be attached or detached from processes.
• New objects can be created or destroyed by threads in a process.
The benefit of this object-level addressing is high degree of sharing that
can be accomplished.
VMM code and interfaces operate on object specified as:
Object
specifier
<object ID,object Offset>
POWER VMM
Design against
IA-64 Design
The POWER architecture provides for efficient access to 256MB objects
(segments in POWER terminology) in the global virtual address space.
The 256 MB objects are also used on IA-64 VMM implementation;
however, segments are implemented in software instead of hardware.
Term “segment” and “object” have the same meaning but keep in mind that
term “segment” in IA64 should be considered in software context.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Memory Object types
Introduction
There are five types of objects defined by the VMM:
• working
• persistent
• client
• log
• mapping
Working
Objects
Working objects (also called working storage and working segments) are
temporary segments used during the execution of a program for its stack
and data areas. Process data are created by the loader at run time and
are page in and page out of paging space. Working storage segment,
holds the amount of paging space allocated to pages in the segment,
associated with it. The part of AIX kernel is also pageable and are the part
of working storage.
Persistent
Objects
The VMM is used for performing I/O operations for file systems. Persistent
objects are used to hold file data for the local file systems. When the
process opens the file, the data pages are page-in. When contents of file
changes the page is marked as modified and eventually page out directly
to original disk location. File system reads and writes occur by attaching
the appropriate file system object and performing loads/stores between the
mapped object and the user buffer. File data pages and also program text
are both part of persistent storage; however, the program text pages are
read only pages and are page-in but never page-out to disk. Persistent
pages are not using paging space.
Client Objects
Client objects are used for pages of client file systems (all file systems
types other than JFS). When remote pages are modified they are marked
and eventually page-out to original disk location across the network.
Remote program text pages (read-only pages) are page out to paging
space from where they can be page in later if needed.
Log Objects
Log objects are used for writing or reading JFS file systems logs during
journalling operations.
Mapping
Objects
Mapping objects are used to support the mmap() interfaces which allows an
application to map multiple objects to the same memory segment.
-8 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Page Mapping
Introduction
This section describes the page mapping functions in the VMM.
VMM Function
The main function of virtual memory manager is to make translations from
effective addresses to real addresses.
Hardware
differences
The exact procedure used by the VMM depends heavily on hardware
processor used by the system. As AIX 5L runs of both Power and IA-64
processors this lesson will describe the process in general terms. More
exact descriptions of address translation can be found in the hardware
specific lessons.
Diagram
This diagram shows the overall relationship among the major AIX data
structures involved in mapping a virtual page to a real page or to paging
space.
effective
address space
hardware specific
table
software
page frame table
real memory
external page
tables (XPT) paging space
SID table
filesystem
file inode
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Page Mapping
-- continued
Hardware Page
Mapping
Hardware page mapping is determined by processor architecture. The
processor generates the hash function which is used to look up the
appropriate hardware tables for the proper translation. The hardware
specific table(s) used on Power is a hardware Frame Page Table (PFT), on
IA-64 a Virtual Hash Page Table (VHPT) is used.
Software Page
Frame Table
Software Page Frame Tables (SWPFT) are extensions of the hardware
frame table and are used and managed by the VMM software. SWPFT
contains informations connected with a page as well as page in, page out
flags, free list flag, block number. It contains also the device information
(PDT) used to obtain the proper page from disk.
Page Faults
Page faults occur when the hardware has looked through its page frame
tables but cannot find a real page mapping for a virtual page.
A page fault causes AIX Virtual Memory Manager (VMM) to do the bulk of
its work. It handles the fault by first verifying that the requested page is
valid. If the page is valid the VMM determines the location of the page,
recovers the page if necessary and updates the hardware’s frame page
table with the location of the page. A faulted page will be recovered from
one of the following locations:
• In physical memory (but not in the hardware PFT).
• On a paging disk (working object)
• On a filesystem object (persistent object)
Protection
Fault
Protection fault occurs when page is in memory but process has no rights
to access it.
.
-10 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Page Not In Hardware Frame Table
introduction
The size of the hardware page tables is limited; therefor, the hardware
can’t satisfy all address translation requests. The VMM software must
supplement the hardware tables with software managed page tables.
Procedure
The procedure used for page fault handling when the page is not found in
hardware specific tables; however is in physical memory consists of
several steps detailed in this illustration and the following table.
effective
address space
hardware specific
table
real page
number
software
page frame table
real memory
external page paging space
tables (XPT)
virtual
page
number
SID table
filesystem
file inode
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Page Not In Hardware Frame Table
Procedure
-- continued
(continued)
Note: these steps assume the memory page is in memory just not in the
hardware page tables..
Step
1
2
3
4
Action
A page fault is generated by the address translation
hardware. The page might be in real memory, just not in
hardware specific table due to its size limits.
The AIX Virtual Memory Manager first verifies that the
requested page is valid. If the page is not valid a kernel
exception is generated.
If the page is valid, the VMM starts looking through the
software PFT for the page. This processing almost
duplicates the hardware processing, but uses software page
tables. The software PFTs are pinned.
If the page is found:
• Hardware specific table is updated with real page number
for this page and process resumes execution.
• No page-in of the page occurs.
Important is to remember that the dispatcher is not run . The faulted
thread just continues the execution at the instruction that caused the fault.
PTEGs
PowerPC processors hash the PFT into Page Table. Equivalence Groups
(PTEGs), and these groups may only be able to hold 16 page entries each.
Since there may be more than 16 pages that hash into one PTEG, the
VMM has to decide which ones are not in the PTEG. Then, when a page
fault occurs for one of these pages, VMM only has to reload the PTEG with
the page in question replacing some other page.
-12 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Page on Paging Space
Introduction
If the page was not found in real memory, VMM determines whether it is
on paging space or else where on disk. If the page is in paging space
the disk block containing the page is located and the page loaded into a
free memory page.
Waiting for I/O
Copying a page from paging space to an available frame is not a
synchronous process. Any process or thread waiting for a page fault to
be handled is put to sleep until the page is available.
Procedure
The procedure for loading a page from paging space is show in this
illustration and in the table that follows.
effective
address space
hardware specific
table
real page
number
software
page frame table
real memory
external page
tables (XPT)
paging space
virtual
page
number
segment ID table
disk
block
number
XPT
address
and page
number
filesystem
file inode
Continued on next pag
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Page on Paging Space
Procedure
-- continued
(continued)
Step
1
2
3
4
5
Action
The VMM looks up the object ID for this address in the
Segment ID table and gets the External Page Table (XPT)
root pointer.
The VMM finds the correct XPT direct block from XPT root.
The VMM gets paging space disk block number from XPT
direct block.
VMM takes the first available frame from the free frame list.
(the free list contains one entry for each free frame of real
memory).
If the free frame list is empty, the VMM uses an algorithm to
select several active pages to steal.
• If the page to be stolen is modified , an I/O request is
issued to write the contents of the selected page to disk.
6
7
8
9
• Once written, the frames containing the stolen pages are
added to the free list, and one is selected to hold the
page from paging space.
VMM indicates device and logical block for the page. An I/O
request loads the frame with the data for the faulting page.
When the I/O completes VMM is notified and the thread
waiting on the frame is awakened.
The disk block is loaded from paging space or the file
system.
The hardware PFT is updated, and the process/thread
resumes at the faulting instruction
The net effect is that the process or thread has no knowledge that a page
fault occurred except for a delay in it’s processing.
Continued on next page
-14 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Page on Paging Space
External Page
Table (XPT)
-- continued
The XPT maps a page within a working storage segments to a disk block
on external storage. The XPT is two level tree structure.
The first level of tree is XPT root block. The second level consists of 256
direct blocks. Each word in the root block is a pointer to one of the direct
block. Each word of the direct block contains the page state and disk block
information for the single page in the segment.
Each XPT direct block covers the 1MB of the 256MB segment.
.
Disk blocks in paging space
0
page 0
XPT Direct Block 0
XPT entry 0
1MB
XPT Root
0
XPT entry 255
page 255
.
.
.
.
.
.
XPT entry 0
255
255MB
page 65280
XPT entry 255
XPT Direct block 255
256MB
page 65535
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Page on Paging Space
Paging Space
Allocation
Policy
-- continued
AIX offers two policies for allocating paging space. If the environment
variable PSALLOC=early, then the early allocation policy is used which
will cause a disk block to be allocated whenever a memory request is
made. This guarantees that the paging space will be available if it is
needed.
If the environment variable is not set, then the default late allocation
policy is used and a disk block is not allocated until it becomes necessary
to page out the page. This policy decreased paging space requirements on
large-memory systems which do little paging.
Free memory
list
The VMM maintains a linked list containing all the currently free real
memory pages in the system. When a page fault occurs, VMM just takes
the first page from this list to assign to the faulting page. When the free
frame list is empty and a page fault occurs, VMM selects several active
pages to be stolen (usually around 20 or so), and all these pages are then
added to the free list This reduces the amount of time spent starting and
running the steal routines.
Paging Device
Table (PDT)
The Paging Device Table (PDT) contains an entry for every device
referenced by the VMM.
It is used for filesystem, paging, log and remote pages.
There is a pending I/O list associated with PDT.
The pending I/O list contains all page frames awaiting I/O for the device.
Page frames are removed from the list as soon as the I/O has been
dispatched to the device.
-16 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Loading Pages From The Filesystem
Introduction
Persistent pages do not use XPT (eXternal Page Table). VMM uses the
information contained in file’s inode structure to locate the pages for the
file.
Procedure
Persistent pages are paged from local files located on a filesystems. Local
files will have a segment allocated and will have an entry (SID) in the
segment information Table. The inode is pointed to by the SID entry
allowing VMM to find and page in the faulting block.
effective
address space
hardware specific
table
real page
number
software
page frame table
real memory
external page
tables (XPT)
paging space
virtual
page
number
segment ID table
disk
block
number
inode
address
filesystem
file inode
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Filesystem I/O
Introduction
The paging functions of the VMM is also used to preform reads and writes
to files by processes.
File system
objects
File system reads and writes occur by attaching the appropriate file system
object and performing loads/stores between the mapped object and the
user buffer. It means that file objects are not directly addressable in the
current address space but instead are temporarily attached.
A local file has a segment allocated and has an entry (SID) in the segment
information Table. File gnode contains the information which segment
belongs to the particular file.
Persistent
pages
AIX is using a large portion of memory as the filesystem buffer cache. The
pages for files compete for the storage the same way as other pages. The
VMM schedules the modified persistent pages to be written to their original
location on disk when:
• VMM needs the frame for another page
• file is closed
• sync operation is performed
The sync operation can be performed by syncd daemon running on the
system (by default the syncd daemon is run every 60 seconds) and by
calling sync() function or running sync command. Scheduling does not
mean that the data are written to disk at once.
-18 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Free Memory and Page Replacement
Introduction
To maintain system performance the VMM always wants some physical
memory to be available for page-ins . This section describes the free
memory list and the algorithms used to keep pages on the list.
Free memory
list
The VMM maintains a linked list containing all the currently free real
memory pages in the system. When a page fault occurs, VMM just takes
the first page from this list to assign to the faulting page. When the free
frame list is empty and a page fault occurs, VMM selects several active
pages to be stolen (usually around 20 or so), and all these pages are then
added to the free list. This reduces the amount of time spent starting and
running the steal routines.
Page
Replacement
Algorithm
The method used to select a page which should be replaced is called Page
Replacement Algorithm. The mechanism used to determine which pages
to steal is a pseudo-LRU (Least Recently Used) algorithm called the
clock-hand algorithm. This algorithm is commonly used in operating
systems when the hardware provides only a reference bit for each page in
physical memory. The hardware automatically sets the reference bit for a
page translation whenever a store occurs to the page. The clock hand
algorithm checks frames by frame number looking for pages that have not
been referenced since the last time the algorithm looked at the page. If a
page has been referenced since the last time the algorithm looked at the
frame, the algorithm clears the reference bit and goes to look at the next
frame. If the page has not been referenced since the last time the
algorithm looked at the frame, the page is stolen
Continued on next page
.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Free Memory and Page Replacement
Clock Hand
-- continued
The algorithm is called the clock-hand algorithm because the algorithm
acts like a clock hand that is constantly pointing at frames in order. The
clock-hand advances whenever the algorithm advances to the next frame.
If a modified page is stolen, the clock-hand algorithm writes the page to
disk (to paging space or a file system) before stealing the page.
Physica
l page
Reference = 1
The reference bit
is changed to
zero when the
clock hand
passes
rotation
Reference = 0
Reference = 0
This page is
eligible to be
stolen
Reference = 1
-20 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
vmtune
Introduction
Some number of pages of different type must retain in memory to maintain
system performance. The VMM keeps the statistics for each page types
by enforcing thresholds in page replacement algorithm. When a number of
pages approaches threshold , the page replacement algorithm selects
proper pages for replacement and favors other pages. VMM takes
appropriate action to bring the state of memory back within bounds.
VMM Tunable
Parameters
The vmtune command changes operational parameters of the Virtual
Memory Manager controlling the thresholds.
Parameter
Description
minfree
Page replacement is invoked whenever the number of free page
frames falls below this threshold.
maxfree
The page replacement algorithm replaces enough pages so that
this number of frames are free when it completes.
LruBucket
Specifies the size (in 4K pages) of the least recently used (lru)
page-replacement bucket size. This is the number of page frames
which will be examined at one time for possible page-outs when a
free frame is needed. A lower number will result in lower latency
when looking for a free frame, but will also result in behavior that is
not as much like a true lru algorithm.
MaxPin
Specifies the maximum percentage of real memory that can be
pinned. The default value is 80. If this value is changed, the new
value should ensure that at least 4MB of real memory will be left
unpinned for use by the kernel.
minperm
Specifies the point below which file pages are protected from the
re-page algorithm. This value is a percentage of the total realmemory page frames in the system. The specified value must be
greater than or equal to 5.
MaxPerm
Specifies the point above which the page stealing algorithm steals
only file pages. This value is expressed as a percentage of the total
real-memory page frames in the system. The specified value must
be greater than or equal to 5.
MinPgAhead
Specifies the number of pages with which sequential read-ahead
starts. This value can range from 0 through 4096. It should be a
power of 2.
MaxPgAhead
Specifies the maximum number of pages to be read ahead. This
value can range from 0 through 4096. It should be a power of 2 and
should be greater than or equal to MinPgAhead.
NpsWarn
Specifies the number of free paging-space pages at which the
operating system begins sending the SIGDANGER signal to
processes. The default value is 512.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Fatal Memory Exceptions
Introduction
Not all page and protection faults can be handled by the O/S. When an
fault occurs that can not be handled by the O/S the system will panic and
immediately halt.
Fatal memory
exceptions
In all of the following cases, the VMM bypasses all kernel exception
handlers and immediately halts the system:
• A page fault occurs in the interrupt environment.
• A page fault occurs with interrupts partially disabled.
• A protection fault occurs while in kernel mode on kernel data.
• The system is out of paging space, or an I/O error occurs on kernel
data.
• An instruction storage exception occurs while in kernel mode.
• A memory exception occurs while in kernel mode without an exception
handler set up.
-22 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Memory Objects (Segments)
Introduction
Each segment has unique segment ID in segment table. There are a
number of important segment types in AIX :
• kernel
• user text
• shared library text
• shared data
• process private
• shared library data
Kernel
segment
This segment is described separately for Power and IA-64 in their lessons.
User text
The user text segment contains the code of the program. Threads in user
mode have read-only access to text segment to prevent the modification
during running of the program. This protection allows a single copy of a
text segment to be shared by all processes associated with the same
program. For example, If the two threads in the system are running the
ls command then the instructions of ls are shared between them.
Running a
debugger
When a debugger is running on a program a private read/write copy of text
segment is used. This allows debaters to set breakpoints directly in code.
In that case the status of text segment is changed from shared to private.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Memory Objects (Segments)
Shared Library
Text
-- continued
The shared library text segment contains mappings whose addresses are
common across all processes. A shared library segment:
• Contains a copy of the program text (instructions) for the shared
libraries currently in use in the system.
• These segments are added to the user address space by the loader
when the first shared library is loaded.
• Each process using text from this segment has a copy of the
corresponding data in the per- process shared library data segment.
Executable modules list the shared libraries they need at exec() time.
The shared library text is loaded into this segment when an module is
loaded via the exec() system call. Or a program may issue load() calls
to get additional shared modules.
Per-Process
Shared Library
Data Segment
The functions in the shared library that have data that can not be shared
between processes and are loaded as process private data.
• This segment holds items required by modules in the shared text
segment(s).
• There is one of these segments for each process
• Addresses of data items are generally the same across processes
• Data itself is not shared
The shared library data segments acts like extension of the process
private segment.
Shared data
Mapped memory regions, also called shared memory areas, can serve as
a large pool for exchanging data among processes.
Process
private
Process Private Segment is not shared between other processes. The
process private segment contains:
• user data (for 32-bit programs that aren’t maxdata programs)
• the user stack (for 32-bit programs)
• text and data from explicitly loaded modules (for 32-bit programs)
• kernel per-process data (accessible only in kernel mode)
• primary kernel thread stack (accessible only in kernel mode)
• per-process loader data (accessible only in kernel mode)
-24 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Shared Memory segments
Introduction
Mapped memory regions, also called shared memory areas, can serve as
a large pool for exchanging data among processes.
• A process can create and/or attach a shared data segment that is
accessible by other processes.
• A shared data segment can represent a single memory object or a
collection of memory objects.
• Shared memory can be attached read-only or read-write.
Benefit
Shared memory areas can be most beneficial when the amount of data to
be exchanged between processes is too large to transfer with messages,
or when many processes maintain a common large database.
Methods of
Sharing
The system provides two methods of sharing memory:
• Mapping file data into the process address space (mmap() services).
• Mapping processes to anonymous memory regions that may be shared
(shmat services).
Shared
memory
address
The shared memory is process based and can be attached at different
effective addresses in different processes
real memory
process A
effective
address
space
process B
effective
address
space
Serialization
VMM
There is no implicit serialization support when two or more processes
access the same shared data segment. The available subroutines do not
provide locks or access control among the processes. Therefore,
processes using shared memory areas must set up a signal or semaphore
control method to prevent access conflicts and to keep one process from
changing data that another is using.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
shmat Memory Services
Introduction
shmat services, are typically used to create and use shared memory
objects from a program.
shmat
functions
Your program can use the following functions to create and manage
shared memory segments.
• shmctl() - Controls shared memory operations
Using shmat
•
shmget() - Gets or creates a shared memory segment
•
shmat()- Attaches a shared memory segment from a process
•
shmdt()- Detaches a shared memory segment from a process
•
disclaim() - Removes a mapping from a specified address range
within a shared memory segment
shmget() system call is used to create a shared memory region and
when supporting larger objects than 256MB shared memory regions,
creates multiple segments.
shmat() system call is used to gain address ability to a shared memory
region.
Limitations
Right now shmget() on the 64-bit kernel is limited to 8 segments even for
64-bit applications. Thus, the largest shared memory region that one can
create is 2Gb. This limitation will be removed if it is a 64-bit application
that performs the shmget(). There will be no explicit limitation, other than
what system resources will bear. 32-bit applications will still retain the 2Gb
limitation.
EXTSHM
Environment variable EXTSHM=ON allows shared memory regions to be
created with page granularity instead of the default segment granularity
thus allowing more shared memory regions within the same sized address
space but no increase in the total amount of share memory region space.
Continued on next page
-26 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
shmat Memory Services
When to use
Guide
-- continued
Use the shmat() services under the following circumstances:
When mapping files larger than 256MB.
For 32-bit application, eleven or fewer files are mapped simultaneously ,
and each is smaller than 256MB
When mapping shared memory regions which need to be shared among
unrelated processes (no parent-child relationship).
When mapping entire files.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Memory Mapped Files
Introduction
Shared segments can be used to map any ordinary file directly into
memory.
• Instead of reading and writing to the file, the program would just load or
store in the segment
• This avoids buffering of the I/O data in the kernel.
• This provides easy random access, as the file data is always available.
• This avoids the system call overhead of read() and write().
• Either shmat() or mmap() system calls can be used
File mapping
The system allows file mapping at the user level. This allows a program to
access file data through loads and stores to its virtual address space. This
single level store approach can also greatly improve performance by
creating a form of Direct Memory Access (DMA) file access. Instead of
buffering the data in the kernel and copying the data from kernel to user,
the file data is mapped directly into the user’s address space.
Shared files
The file can even be shared between multiple processes even if some are
using mapping and others are using the read/ write system call interface.
Of course, this may require some sort of synchronization scheme between
the processes.
shmat to map
files
When using shmat to map memory file an open file descriptor is used in
place of shared memory ID. Once the file segment is mapped , it is treated
like any other shared segment and can be shared with other processes.
Continued on next page
-28 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Guide
Memory Mapped Files
mmap services
-- continued
mmap services, is typically used for mapping files, although it may be used
for creating shared memory segments as well.
• madvise() - Advises the system of a process' expected paging
behavior
• mincore() - Determines residency of memory pages
• mmap() - Maps an object file into virtual memory
• mprotect() - Modifies the access protections of memory mapping
• msync() - Synchronizes a mapped file with its underlying storage
device
• munmap() - Un-maps a mapped memory region
Both the mmap and shmat services provide the capability for multiple
processes to map the same region of an object such that they share
address ability to that object. However, the mmap subroutine extends this
capability beyond that provided by the shmat subroutine by allowing a
relatively unlimited number of such mappings to be established.
When to use
mmap
Use mmap under the following circumstances:
Portability of the application is a concern.
Many files are mapped simultaneously.
Only a portion of a file needs to be mapped.
Page-level protection needs to be set on the mapping.
Private mapping is required.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 30
Guide
Draft Version for review, Sunday, 15. October 2000, common_vmm.fm
Memory Mapped Files
Mapping Types
-- continued
There are a 3 mapping types :
• read-write mapping
• read-only mapping
• deferred-update mapping
Read-Write
Mapping
Read -write mapping allows loads and stores in the segment to behave
like reads and writes to the corresponding file. If a thread loads beyond
the end of the file, the load will load zero values.
Read-only
Mapping
Read only mapping allows only loads from the segment. The operating
system generates a SIGSEGV signal if a program attempts an access that
exceeds the access permission given to a memory region. Just as with
read-write access, a thread that loads beyond the end of the file loads zero
values.
Deferred
Update
Mapping
Deferred update mapping also allows loads and stores to the segment to
behave like reads and writes to the corresponding file. The difference
between this mapping and read-write mapping is that the modifications are
delayed. Any storing into the segment modifies the segment but does not
modify the corresponding file.
With deferred update, the application can begin modifying the file data (by
memory mapped loads and stores) and then either commit the
modifications to the file system (via fsync()) or discard the modifications
completely. This can greatly simplify error recover and allows the
application to avoid a costly temporary file that may otherwise be required.
Data written to a file that a process has opened for deferred update (with
the O_DEFER flag) is not written to permanent storage until another process issues an fsync() subroutine against this file or runs a synchronous write subroutine (with the O_SYNC flag) on this file.
-30 of 30 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Guide
Unit 9. IA-64 Virtual Memory Manager
Objectives
After completing this unit, you should be able to
• List the size of the effective and virtual address space on the IA64
platform.
• .Show how regions, region register, and region ID are used in AIX
5L.
• Name the region register that is used to identify a processes
private region.
• Given an address identify the region it belongs.
References
• Intel IA-64 Architecture , Software Developer’s Manual
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
IA-64 Addressing Introduction
Introduction
AIX-5L on the IA-64 platform is designed as a 64-bit kernel. Unlike the
Power version of AIX 5L no 32-bit kernel is available. This lesson
describes the address translation mechanism used by AIX 5L on the IA64
platform.
Overview
The IA-64 platform provides an effective address space that is 64-bits
wide.
• The effective address space is divided into eight regions.
• Each region has a region register associated with it (rr0 - rr7).
• The region registers under control of the OS supplies an additional 24
bits of addressing creating a 85-bit virtual address space.
ILP32
In addition to a 64-bit programming model AIX 5L provides a 32-bit
address environment (ILP32). The IPL32 address space is 4 GB. A zero
extension model is used to convert 32-bit address to 64-bits for address
translation. The ILP32 effective address space is completely contained in
the first 4 GB of the 64-bit model.
-2 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Guide
Regions
Introduction
The 64 bit effective address is broken into 8 regions. This section
describes how the regions are addressed.
Region
selector
The 64-bit effective address space consists of 8 regions each region
addressed by 61 bits. A region is selected by the upper 3 bits of the
effective address. Each region has a region register associated with it (
rr0 - rr7)that contains a 24-bit Region IDentifier for the region. When
translating effective addresses to virtual addresses the 24 bit region
identifier is combined with the lower 61 bits of the virtual address to form a
85 bit virtual address..
63
60
0
3 bits
61 bits
region ID
24 bits
2
Managing
region
registers
61
*2
24
=2
85
The AIX 5L operating system manages the contents of region registers.
An address space is made accessible to a processes by loading the
proper RID to one of the eight region registers.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Region Registers
Introduction
Region
Registers
Each region register contains a Region IDentifier (RID) and region
attributes.
The fields making up the region registers are detailed:
63
32
rv
8
rid
ps
2 1
rv
0
ve
Region Register
field
description
rv
reserved
ve
VHPT Walker Enable
1-VHPT walker is enabled for the region
0-VHTP walker is disabled for the region
ps
Preferred page size. Selects the virtual address bits for hash
function for TLB or VHPT
rid
24-bit region identifier
-4 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Guide
Address Translation
Introduction
The VMM software in AIX 5L works closely with the hardware to translate
effective address to an address in physical memory.
VMM hardware
This diagram and the table on the next page describe the hardware
compoints and the process used to preform address translations..
Region registers
63
rr0
effective address
60
region id
virtual page number
0
offset
rr7
24
hash
search
region id
key
search
VPN
rights
virtual address
physical pno.
Translation Lookaside Buffer
key
rights
protection key registers
62
physical page number
physical address
0
offset
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Address Translation
Address
Translation
Details
-- continued
This table describes the process of address translation.
Step
1
Action
Effective address contains three parts:
• Virtual Region Number (VRN),
• Virtual Page Number (VPN)
2
3
4
5
6
• Page Offset
The 3 VRN bits are used to select region register.
The region register provides a 24 bit region ID.
The region ID and the virtual page number are used to
search for an address translation found in the TLB or the
hardware maintained page tables.
If no match is found a page fault is generated transferring
control to the OS. The OS must resolve the fault by making
a page available and updating the translation tables.
A successful translation produces a physical page number.
This page number is combined with the page offset to
produce a physical address.
32 bit Address
Translation
32 bit address translation is done the same way as 64 bit translation.
There is no bit in processor hardware telling that hardware is working in 32
or 64 bit mode as it is on POWER.
Translation
Lookaside
Buffer
The cache of active virtual memory addresses is called Translation
Lookaside Buffer (TLB). TLB contains Page Table Entries (PTE) that were
recently used. The TLB stores recently used virtual addresses and
corresponding physical addresses.
-6 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Guide
Single vs. Multiple Address Space
Introduction
The IA-64 model provides the ability for ether a Single or Multiple address
space model. These models and are described in this section.
Single
Address Space
(SAS)
In a single address model all process on the system share a single
address space. Such a model is possible due to the enormous size of a
64-bit address space as opposed to a 32-bit one. The term single address
space refers to the use of shared regions containing objects mapped at a
unique global address. For such mapping a common region ID and page
number is provided.
Multiple
Address Space
(MAS)
In this model each process has a private address space. Not all of the 8
regions can be used by a process because the operating system must be
mapped on top of one or more of the regions. For each process private
region(s) there is unique RID associated with it.
Address Space
on IA-64
The address space model used by AIX on IA-64 combines attributes of
both MAS (multiple address space) and SAS (single address space).
Region 0 is defined by the operating system to be a process private region.
Each process is assigned unique RID for that region which is loaded into
region register each time the process is dispatched. Therefore region 0
provides what is effectively a MAS model.
All other regions are treats as shared address space (SAS), as such the
region ID’s for those regions are constant and don’t need to be changed at
context switch. SAS usage is necessary to achieve the desired degree of
sharing of address translations for shared objects: to achieve a single
translation for an object all accesses must be made through a common
global address.
The sharing semantic (private, globally shared, shared-by-some) is
determined by whether or not multiple processes utilize the same RID and
also in the case of “shared-by-some” ,whether they have access to specific
protection keys.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
AIX 5L Region Usage
Introduction
The region identifier (RID), much like the POWER segment identifier (SID),
participates in the hardware address translation such that in order to share
the same address translation, the same RID must be used. For a process
to share a memory region with another process (or the kernel) the same
RID must be loaded in the region register in both process’s context.
Region Usage
Table
The following table shows the kernel usage model for the 8 virtual regions
VRN
Style
0 MAS
Name
Private
1
SAS/MAS
Text
2
3
4
5
SAS
SAS
n/a
SAS
Temp
6
7
SAS
SAS
Kernel2
Kernel
Usage
process data, stack , heap , mmap ,
ILP32 shared library text,ILP32 main
text,u-block,kernel thread stacks/
msts
LP64 shared library text,LP64 main
text
LP64shmat
LP64 shmat
reserved
kernel temporary attach , global
buffer pool
kernel global w/large page size
kernel global
Continued on next page
-8 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
AIX 5L Region Usage
Region Usage
Details
Guide
-- continued
Region usage is detailed here:
• Region 0 is the process private region. Only the running process will
have access to its own private region.
• Region 1 is dedicated for mappings of LP64 executable text. This
includes globally shared text such as shared libraries and share-bysome text such as the main text of a program. This region is SAS under
normal circumstances and is MAS when the process is being
debugged.
• Regions 2-3 are the primary residence of shared non-text mappings
which include user mappings via shmat.Region 4 is reserved for future
use.
• Region 5 is dedicated to support of kernel temporary attach. In AIX 5L
the temporary attach mechanism has been adapted to promote the SAS
model.
• Regions 6-7 contain kernel global mappings.
ILP32
The address space of a 32-bit programs (using the ILP32 instruction set)
is from 0 to 4GB and is solely contained in region 0.
Private
segment
Providing process data, heap, and stack as well as per-process kernel
information such as the u-block in a single private segment means that just
that segment needs to be copied across fork (e.g. copy-on-write
semantics).
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Memory Protection
Introduction
The IA-64 architecture provides two methods for applying protection to a
page:
• Access rights for each translation.
• Protection keys
Protection
Keys
Protection keys are used to control which processes have access to
individual objects in the single address space to achieve a “shared-bysome” semantic, such as exists for shmat objects.
There is a special bit in hardware and when this bit is turned on(1) then
memory references go through protection key access checks during
address translations.
There are also protection key registers (at least 16) and VMM manages
and keeps track of the particular entry.
Protection key
register fields
Protection key register fields:
field
v
wd
rd
xd
key
usage
valid bit.When 1 it means that register contains valid key
write disable.When 1 ,write permissions denied
read disable.When 1, read permission is denied.
execute disable.When 1 ,execute permission is denied.
protection key(18-24 bits)
Continued on next page
-10 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Memory Protection
Process
Guide
-- continued
The process of memory access using protection keys is described in this
table.
Step
1
2
3
4
5
Action
During an address translation by the hardware a protection
key is identified for the page being translated.
The protection key of the translation is checked against
protection keys found in protection key registers (stored by
the OS).
If the match succeeds then protection rights are applied to
the translation. The access can be allowed or not allowed
based on the protection key value.
If the access is not allowed, then the protection key
permission fault is raised and control goes to VMM.
In the case when match is not found ( from step 2) the
protection key mss fault is raised and VMM inserts the
correct protection key into protection key registers.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Memory Protection
Protection Key
Example
-- continued
An example of protection key usage is described in this illustration and
table.
process A address space
virtual address space
shared object
process B address space
Step
1
2
Action
A shared object is assigned the protection key 0x1.
Processes A and B share the object with the following
permissions:
• Process A has read/write access to the object.
3
4
• Process B has read-only access to the object.
When A is running VMM inserts the protection key register
with 0x1 and the ‘wd’ and ‘rd’ bits cleared. The process can
read and write all pages in the object.
When B is running VMM inserts the protection key register
with 0x1 and the ‘rd’ bit cleared.The process can only read
pages in the object.
Continued on next page
-12 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Memory Protection
Access Rights
Guide
-- continued
In addition to the protection key mechanism the IA-64 architecture
provides page protection by associating access and privilege level
information witch each translation. However, the majority of page access
rights support in AIX 5L is in the common code base shared with POWER.
Therefore the software mechanism for dealing with page protection were
all left as is so at the upper layers conform to the POWER access rights
mechanisms. These consist of:
• per segment K bits
• POWER style per-page protection bits.
At the low platform dependent layer , these POWER style protections are
translated to the IA-64 hardware informations.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
LP64 Address Space
Introduction
Segments and segment services are used for management of objects both
on POWER and IA64.
Segments on
IA64
The segment model was original developed with the Power hardware
architecture in mind. A segment can be thought of as a hardware object
on Power. Selection of the segment is made directly by the hardware’s
translation of a virtual address. As we have seen the IA64 hardware
address memory by regions. A regions is a much larger areas of the
virtual address space that a segment. On IA64 the software manage
segments on top of the region model; therefor, on IA64 a segment is a
software object not a hardware one.
The user space segment model on IA64 is shown in this table:
ESID (hex)
0000_0000_0-0000_0000_F
0000_0001_0-0000_0001_F
0000_0002_0-0000_0002_F
0000_0003_0-0000_0003_F
0000_0100_0-0001_FFFF_F
0002_0000_0-0002_FFFF_F
0003_0000_0-0003_FEFF_F
0003_FF00_0-0003_FF00_2
0003_FF00_2-0003_FF00_3
0003_FF00_3-0003_FF0F_F
0003_FF10_0-0003_FFFF_F
2000_0001_0-2000_0001_F
2000_0100_0-2003_FFFF_F
4000_0001_0-4003_FFFF_F
6000_0001_0-6003_FFFF_F
-14 of 18 AIX 5L Internals
Name
Low 4GB Reserved
Aliased Main Text
Private Dynamically Loaded
Text
Private Data, BSS
Private Heap
Default Mmap, Aliased Shmat
User Stack
Kernel reserved
Process Private Segment
Kernel Thread Segments
Kernel reserved
LP64 Shared Library Text
Main Text
Global Shmat (normal page
size)
Global Shmat (superpage)
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Guide
ILP32 Address Space
Introduction
The layout of the 4GB ILP32 address space is principally the same as that
for POWER 32-bit applications. The motivations for preserving this layout
for IA64 are compatibility and performance.
This table details the segment usage for the ILP32 model.:
ESID
Big Data Model
Name
0
1
2
n/a
Text
Private
3-12
13
14
15
n/a
Shlib Text
n/a
Shlib Data
Example Uses
Not used
Main text
main+libc data, stack,
heap, u-block, kernel
stack
shmat, mmap
Shared library text
shmat, mmap
Post-exec data,
private text
A big data model is supported for 32-bit applications on POWER. This
allows an application to specify maximum requirements for heap, data, and
stack.Such a model is required for programs which exceed the limits
imposed by the normal 32-bit address space (i.e. a shared 256MB
segment for heap, data, and stack).This model will be also supported on
IA64 for 32 bit applications in future releases
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 18
Guide
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Exercise
Introduction
Complete the following written exercise and the lab exercise on the
following page.
Test yourself
Complete the following questions.
1. The effective address size for a 64 bit process is?
A. 32 bits
B. 64 bits
C. 84 bits
2. The virtual address size on the IA-64 platform is?
A. 32 bits
B. 64 bits
C. 84 bits
3. One of eight region registers is used for each address translation. How
is the region register selected?
4. A 64-bit process running on AIX 5L on IA-64 hardware has a private
region of memory that is located in what region?
Continued on next page
-16 of 18 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Exercise
Lab
Guide
-- continued
Follow the instruction in this table to complete this lab.
Step
1
2
3
Action
Logon to you IA64 lab system.
su to root and start the iadb utility.
$ su
# iadb
Display the thread structure for the current context using the
command:
0> th
4
© Copyright IBM Corp. 2000
The thread structure displayed will be the thread for the
running iadb process.
Look for the field labeled t_procp this will contain a
pointer to the proc structure. Examine this address. What
region is this address in?
5
Look for the field labeled userp this will contain a pointer to
the threads user area. Examine this address. What region
is this address in?
6
Of the two address you examined witch one is in the
process’s privet region?
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 18
Guide
-18 of 18 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, IA-64vmm.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm
Guide
Unit 10. IA-64 Linkage Convention
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 6
Guide
-2 of 6 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm
Guide
© Copyright IBM Corp. 2000
-3 of 6
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
Guide
-4 of 6 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm
Guide
© Copyright IBM Corp. 2000
-5 of 6
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
Guide
-6 of 6 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, IA64LinkageStub.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Unit 11. LVM
Lesson Objectives
At the end of the module the student should have gained knowledge
about:
Have an overview of the LVM, and Identify the LVM components such as
• Logical volume
• Physical volume
• Mirroring, and parameters for mirroring
• Striping and parameters for striping
Physical disk layout Power
Physical disk layout IA-64
LVM Physical layout including VGDA and VGSA
Know the function of LVM Passive Mirror Write Consistency
Know the function of LVM Hot spare disk
Know the function of LVM Hot spot management
Know the function of LVM Online backup (4.3.3.)
Know the function of LVM Variable logical track group (LTG)
Know the function of each of the High-Level LVM commands
Trace LVM commands with the trace command
Know the function of LVM Library calls
Know briefly about Disk Device Calls
Know briefly about Disk low level Device Calls such as SCSI calls and
SSA
Furthermore it is an objective that the student get experience from
exercises with the content of this section. The exercises will
• Examine the physical disk layout of a logical volume and a physical
volume.
• Examinine the impact of LVM Passive Mirror Write Consistency
• Examinine the function of LVM LTG
• Trace some LVM system activity.
Platform
This lesson is independent of platform.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
References
http://w3.austin.ibm.com/:/projects/tteduc/
-2 of 64 AIX 5L Internals
Technology Transfer Home Page
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Logical Volume Manager overview
Introduction
The Logical Volume Manager (LVM) is the layer between the operating
system (AIX) and the physical hard drives, the LVM provides reliable data
storage (Logical volumes) to the OS. The LVM make use of the underlying
physical storage, but hides the actual physical drives and drive layout. This
section will explain how its done, how the data can be traced, and which
parameters impacts the performance in different scenarios.
Physical
volume
A hierarchy of structures is used to manage fixed-disk storage. Each
individual fixed-disk drive, called a physical volume (PV) has a name, such
as /dev/hdisk0. Every physical volume in use belongs to a volume group
(VG). All of the physical volumes in a volume group are divided into
physical partitions (PPs) of the same size (by default 2MB in volume
groups that include physical volumes smaller than 300MB, 4MB
otherwise). For space-allocation purposes, each physical volume is
divided into five regions (outer_edge, inner_edge, outer_middle,
inner_middle and center). The number of physical partitions in each region
varies, depending on the total capacity of the disk drive.
Within each volume group, one or more logical volumes (LVs) are defined.
Logical
volume
Logical volumes are groups of information located on physical volumes.
Data on logical volumes appears to be contiguous to the user but can be
discontiguous on the physical volume. This allows file systems, paging
space, and other logical volumes to be resized or relocated, span multiple
physical volumes, and have their contents replicated for greater flexibility
and availability in the storage of data.
Each logical volume consists of one or more logical partitions (LPs). Each
logical partition corresponds to at least one physical partition. If mirroring is
specified for the logical volume, additional physical partitions are allocated
to store the additional copies of each logical partition.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
Physical disks
-- continued
A disk must be designated as a physical volume and be put into an
available state before AIX can assign it to a volume group. A physical
volume has certain configuration and identification information written on it.
This information includes a physical volume identifier and for IA-64
partition information for the disk. When a disk becomes a physical volume,
it is divided into 512-byte physical blocks.
The first time you start up the system after connecting a new disk, AIX
detects the disk and examines it to see if it already has a unique physical
volume identifier in its boot record. If it does, the disk is designated as a
physical volume and a physical volume name (typically, hdiskx where x is a
unique number on the system) is permanently associated with that disk
until you undefine it.
Volume groups
The physical volume must become part of a volume group before it can be
utilized by LVM. A volume group is a collection of 1 to 32 physical volumes
of varying sizes and types. A physical volume may belong to only one
volume group. The system will as default allow you to define up to 256
logical volumes per volume group, but the actual number you can define
depends on the total amount of physical storage defined for that volume
group and the size of the logical volumes you define.
There can be up to 255 volume groups per system.
A VG that is created with standard physical and logical volume limits can
be converted to big format which can hold up to 128 PVs and up to 512
more LVs. This operation requires that there be enough free partitions on
every PV in the VG for the Volume group descriptor area (VGDA)
expansion.
MAXPVS: 32 (128 big VG) MAXLVS: 255 (512 big VG)
Logical Storage Management
Volume groups
Physical volume
Physical partition
Logical volumes
Logical partitions
255 per system
(MAXPVS / volume group factor)
per volume group
(1016 x volume group factor)
volume group factor = 1, 2, 4, 8,
16, 32, 64, 28, or 256 MB
MAXLVS per volume group
(MAXPVS * 1016) per logical
volume
Continued on next page
-4 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
Physical
partitions PP
Guide
-- continued
In the design of LVM, each logical partition maps to one physical partition.
And, each physical partition maps to a number of disk sectors. The design
of LVM limits the number of Physical Partitions that LVM can track per disk
to 1016. In most cases, not all the possible 1016 tracking partitions are
used by a disk. The default size of each physical partition during a "mkvg"
command is 4 MB, which implies that individual disks up to 4 GB can be
included in a volume group.
If a disk larger than 4 Gb is added to a volume group (based on usage of
the 4 MB size for Physical Partition) the disk addition will fail with a warning
message that the physical partition size needs to be increased. There are
two instances where this limitation will be enforced. The first case is when
the user tries to use "mkvg" to create a volume group where the number of
physical partitions on one of the disks in the volume group would exceed
1016. In this case, the user must pick from the
available physical partition size ranges of 1, 2, (4), 8, 16, 32, 64, 128, and
256 megabytes and use the "-s" option to "mkvg". The second case is
where the disk which violates the 1016 limitation is attempting to join a preexisting volume group with the "extendvg" command. The user can either
recreate the volume group with a larger physical partition size (which will
allow the new disk to work with the 1016 limitation) or the user can create a
stand-alone volume group (consisting of a larger physical partition size) for
the new disks.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
Device drivers,
hierachy and
interface to
LVM devices
-- continued
The figure shows the interfaces to the LVM at different layers, starting top
down, the file system JFS or J2, use the LVMDD API interface to access
LV’s, the LVMDD use the disk DD to access the physical disk which is
handles by the SCSI DD or the SSA DD depending on the type of disk. we
do also have interface and commands to manipulate the LVM system, the
high level commands are complex commands written as shell scripts as
the mklv command. These scripts use basic LVM commands, such as
lcreatelv, which are AIX binaries to perform the operations. The basic
commands are written in C and use the LVM API liblvm.a access the LVM.
JFS
High level
commands
LVM DD
commands
Disk DD
liblvm.a
SCSI
DD
SSA
DD
Continued on next page
-6 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
VGDA
description
Guide
-- continued
The VGDA is an area at the front of each disk which contains information
about the volume group, the logical volumes that reside on the volume
group and disks that make up the volume group. For each disk in a volume
group, there exists a VGDA concerning that volume group. This VGDA
area is also used in quorum voting.
The VGDA contains information about what other disks make up the
volume group. This information is what allows the user to just specify one
of the disks in the volume group when they are using the "importvg"
command to import a volume group into an AIX system. The importvg will
go to that disk, read the VGDA and find out what other disks (by PVID)
make up the volume group and automatically import those disks into the
system. The information about neighboring disks can sometimes be useful
in data recovery. For the logical volumes that exist on that disk, the VGDA
gives information about that logical volume so anytime some change is
done to the status of the logical volume (creation, extension, or deletion),
then the VGDA on that disk and the others in the volume group
must be updated.
The VGDA space, that allows for 32 disks, is a fixed size which is part of
the LVM design. Large disks require more management mapping space in
the VGDA, which causes the number and size of available disks to be
added to the existing volume group to shrink. When a disk is added to a
volume group, not only does the new disk get a copy of the updated
VGDA, but as mentioned before, all existing drives in the volume group
must be able to accept the new, updated
VGDA.
VGSA
description
The Volume Group Status Area (VGSA) records information on stale
partitions for mirroring.
The VGSA is comprised of 127 bytes, where each bit in the bytes
represents up to 1016 physical partitions that reside on each disk. The bits
of the VGSA are used as a quick bit-mask to determine which physical
partitions, if any, have become stale. This is only important in the case of
mirroring where there exists more than one copy of the physical partition.
Stale partitions are flagged by the VGSA. Unlike the VGDA, the VGSA’s
are specific only to the drives which they exist. They do not contain
information about the status of partitions on other drives in the same
volume group. The VGSA is also used to determine which physical
partitions must undergo data resyncing when mirror copy resolution is
performed.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
BIG VGDA
Volume Group
Design
(BigVG)
implemented
in AIX 4.3.2
-- continued
The original design of the VGDA and VGSA limit the number of disks that
can be added to a volume group to 32, and the total number of logical
volumes to 256 (including one reserved for LVM internal use). With the
proliferation of disk arrays, the need for increased capacity in a single
volume group is growing.
This section describes the requirements for a new big Volume Group
Descriptor Area and Volume Group Status Areas, here after referred as
VGDA and VGSA.
Objectives
• Increase maximum number of disk per VG from 32 to 128
• Increase maximum number of logical volumes per VG to 512
• Provide migration path for small VG to big VG
Changes in commands:
• mkvg
• -B option is added to create big VGs.
• -t If the t flag (factor value) is not used, the default total of
1016physical partitions per physical volume limit will be set. Using
the factor value will change the physical partitions per disk to 1016*
factor and the total number of disks per VG to 64/factor. BigVG can
not be imported/activate into systems with pre AIX 4.3.2 versions.
• chvg
• -B option added to convert the small VG to bigVG. -B flag can be
used to convert the small VG to the bigVG format. This operation will
expand the VGDA/VGSA to change the total number of disks that
can be added to the volume group from 1-32 to 64. Once converted,
these volume groups cannot be imported/activated into systems
running pre AIX 4.3.2 versions. If both t and B flags are specified,
factor will be update first and then VG is converted to bigVG format
(sequential operation).
Continued on next page
-8 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
LVM Flexibility
Guide
-- continued
LVM offer great flexibility for the system administrator and users such as
• Real-time Volume Group and Logical Volume expansion/deletion
• Ability to customize data integrity check
• Use of Logical Volume under file system
• Use of Logical Volume as raw data storage
• User customized logical volumes
Real-time
Volume Group
and Logical
Volume
expansion /
deletion
Typical UNIX operating systems have static file systems that require the
archiving, deletion, and recreation of larger file systems in order for an
existing file system to expand. LVM allows the user to add disks to the
system without bringing the system down and allows the real-time
expansion of the file system through the use of the logical volume. All file
systems exist on top of logical volumes. However, logical volumes can
exist without the presence of a file system. When a file system is created,
the system first creates a logical volume, then places the journaled file
system (jfs) "layer" on top of that logical volume. When a file system is
expanded, the logical volume associated with that file system is first
"grown", then the jfs is "stretched" to match the grown logical volume.
Ability to
customize data
integrity
checks
The user has the ability to control which levels of data integrity checks are
placed in the LVM code in order to tune the system performance. The user
can change the mirror write consistency check, create mirroring, and
change the requirement for quorum in a volume group.
Use of Logical
Volume under
a file system
The logical volume is a logical to physical entity which allows the mapping
of data. The jfs maps files defined in its file system in its own logical way
and then translates file actions to a logical request. This logical request is
sent to the LVM device driver which converts this logical request into a
physical request. When the LVM device driver sends this physical request
to the disk device driver, it is further translated into another physical
mapping. At this level, LVM does not care about where the data is truly
located on the disk platter. But with this logical to physical abstraction, LVM
provides for the easy expansion of a file system, ease in mirroring data for
a file system, and the performance improvement of file access in certain
LVM configurations.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
-- continued
Use of Logical
Volumes as
raw data
storage
As stated before, the logical volume can run without the existence of the jfs
file system to hold data. Typically, database programs use the "raw" logical
volume as a data "device" or "disk". They use the LVM logical volumes
(rather than the raw disk itself) because LVM allows them to control which
disks the data resides, allows the flexibility to add disks and "grow" the
logical volume, and gives data integrity with the mirroring of the data via
the logical volume mirroring capability.
User
customized
logical
volumes
The user can create logical volumes, using a map file, that will allow them
to specify the exact disk(s) the logical volume will inhabit and the exact
order on the disk(s) that the logical volume will be created in. This ability
allows the user to tune the creation of their logical volumes for
performance cases.
Write Verify
LVM setting
There is a capability in LVM to specify that you wish an extra level of data
integrity is assured every time you write data to the disk. This is the ability
known as write verify. This capability is given to each logical volume in a
volume group. When you have write verify enabled, every write to a
physical portion of a disk that’s part of a logical volume causes the disk
device driver to issue the Write and Verify scsi command to the disk. This
means that after each write, the disk will reread the data and do an IOCC
parity check on the data to see if what the platter wrote exactly matched
what the write request buffer contained. This type of extra check
understandably adds more time to the completion length of a write request,
but it adds to the integrity of the system.
Continued on next page
-10 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Logical Volume Manager overview
Quorum
checking for
LVM volume
groups
Guide
-- continued
Quorum checking is the voting that goes on between disks in a volume
group to see if a majority of disks exist to form a quorum that will allow the
disks in a volume group to become and stay activated. LVM runs many of
its commands and strategies based on having the most current copy of
some data. Thus, it needs a method to compare data on two or more disks
and figure out which one contains the most current information. This need
gives rise to the need of a quorum. If not enough quorums can be found
during a varyonvg command, the volume group will not varyon.
Additionally, if a disk dies during normal operation and the loss of the disk
causes volume group quorum to be lost, then the volume group will notify
the user that it is ceasing to allow any more disk i/o to the remaining disks
and enforces this by performing a self varyoffvg. However, the user can
turn off this quorum check and its actions by telling LVM that it always
wants to varyon or stay up regardless of the dependability of the system.
Or, the user can force the varyon of a volume group that doesn’t have
quorum. At this point, the user is responsible for any strange behavior from
that volume group.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Data Integrity and LVM Mirroring
Mirroring, and
parameters for
mirroring
When discussing mirrors in LVM, it is easier to refer to each copy,
regardless of when it was created, as a copy. the exception to this is when
one discusses Sequential mirroring. In Sequential mirroring, there is a
distinct PRIMARY copy and SECONDARY copies. However, the majority
of mirrors created on AIX systems are of the Parallel type. In Parallel
mode, there is no PRIMARY or SECONDARY mirror. All copies in a
mirrored set are just referred to as copy, regardless of which one was
created first. Since the user can remove any copy from any disk, at any
time, there can be no ordering of copies.
AIX allows up to three copies of a logical volume and the copies may be in
sequential or parallel arrangements. Mirrors improve the data integrity of a
system by providing more than one source of identical data. With multiple
copies of a logical volume, if one copy cannot provide the data, one or two
secondary copies may be accessed to provided the desired data.
Staleness of
Mirrors
The idea of a mirror is to provide an alternate, physical copy of information.
If one of the copies has become unavailable, usually due to disk failure,
then we refer to that copy of the mirror as going "stale". Staleness is
determined by the LVM device driver when a request to the disk device
driver returns with a certain type of error. When this occurs, the LVM
device driver notifies the VGSA of a disk that a particular physical partition
on that disk is stale. This information will prevent
further read or writes from being issued to physical partitions defined as
stale by the VGSA of that disk. Additionally, when the disk once again
becomes available (suppose it had been turned off accidentally), the
synchronization code knows which exact physical partitions must be
updated, instead of defaulting to the update of the entire disk. Certain High
Level commands will display the physical partitions and their stale
condition so that the user can realize which disks may be experiencing a
physical failure.
Continued on next page
-12 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Data Integrity and LVM Mirroring
Sequential
Mirroring
-- continued
Sequential vs. Parallel mirror, and What good is Sequential Mirroring?
Sequential mirroring is based on the concept of an order within mirrors. All
read and write requests first go through a PRIMARY copy which services
the request. If the request is a write, then the write request is propagated
sequentially to the SECONDARY drives. Once the secondary drives have
serviced the same write request, then the LVM device driver will consider
the write request complete.
Parallel
Mirroring
In Parallel mirroring, all copies are of equal ordering. Thus, when a read
request arrives to the LVM, there is no first or favorite copy that is
accessed for the read. A search is done on the request queues for the
drives which contain the mirror physical partition that is required. The drive
that has the fewest requests is picked as the disk drive which will service
the read request. On write requests, the LVM driver will broadcast to all
drives which have a copy of the physical partition that needs updating.
Only when all write requests return will the write be considered complete
and the write-complete message will be returned to the calling program.
Disk 1
Write
req
Disk 2
Write
ack
Write
req
Disk 3
Write
ack
Write
ack
Write
req
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Data Integrity and LVM Mirroring
Mirror Write
Consistency
Check
-- continued
Mirror Write Consistency Check (MWCC) is a method of tracking the last
62 writes to a mirrored logical volume. If the AIX system crashes, upon
reboot the last 62 writes to mirrors are examined and one of the mirrors is
used as a "source" to synchronize the mirrors (based on the last 62 disk
locations that were written). This "source" is of importance to parallel
mirrored systems. In sequentially mirrored systems, the "source" is always
picked to be the Primary disk. If that disk fails to respond, the next disk in
the sequential ordering will be picked as the "source" copy. There is a
chance that the mirror picked as "source" to correct the other mirrors was
not the one that received the latest write before the system crashed. Thus,
the write that may have completed on one copy and incomplete on another
mirror would be lost.
AIX does not guarantee that the absolute, latest write request completed
before a crash will be there after the system reboots. But, AIX will
guarantee that the parallel mirrors will be consistent with each other. If the
mirrors are consistent with each other, then the user will be able to realize
which writes were considered successful before the system crashed and
which writes will be retried. The point here is not data accuracy, but data
consistency. The use of the Primary mirror copy
as the source disk is the basic reason that sequential mirroring is offered.
Not only is data consistency guaranteed with MWCC, but the use of the
Primary mirror as the source disk increases the chance that all the copies
have the latest write that occurred before the mirrored system crashed.
Ability to
detect stale
mirror copies
and correct
The Volume Group Status Area (VGSA) tracks the status of 1016 physical
partitions per disk per volume group. During a read or write, if the LVM
device driver detects that there was a failure in fulfilling a request, the
VGSA will note the physical partition(s) that failed and mark that
partition(s) "stale". When a partition is marked stale, this is logged by AIX
error logging and the LVM device driver will know not to send further
partition data requests to that stale partition. This saves wasted time in
sending i/o requests to a partition that most likely will not respond. And
when this physical problem is corrected, the VGSA will tell the mirror
synchronization code which partitions need to be updated to have the
mirrors contain the same data.
-14 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
LVM Striping
Striping and
parameters for
striping
Disk striping is the concept of spreading sequential data across more than
one disk to improve disk i/o. The theory is that if you have data that is close
to each other, and if you can divide the request into more than one disk i/o,
you will reduce the time it takes to get the entire piece of data. This request
must be done so it is transparent to the user. The user doesn’t know which
pieces of the data reside on which disk and does not see the data until all
the disk i/o has completed (in the case of a read) and the data has been
reassembled for the user. Since LVM has the concept of a logical to
physical mapping already built into its design, the concept of disk striping
is an easy evolution. Striping is broken down into the "width" of a stripe and
the "stripe length". The width is how many disks the sequential data should
lay across. The stripe length is how many sequential bytes reside on one
disk before the data jumps to another disk to continue the sequential
information path.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Striping
Striping
Example
-- continued
We present an example to show the benefit of striping: A piece of data that
is stored of the disk is 100 bytes. The physical cache of the system is only
25 bytes. Thus, it takes 4 read requests to the same disk to complete the
reading of 100 bytes: As you can see, since the data is on the same disk,
four sequential reads must be required.
hdisk0: First read-bytes 0-24
hdisk0: Second read-bytes 25-49
hdisk0: Third read-bytes 50-74
hdisk0: Fourth read-bytes 75-99
If this logical volume were created with a stripe width of 4 (how many
disks) and a stripe size of 25 (how many consecutive bytes before going to
the next disk), then you would see:
hdisk0: First read-bytes 0-24
hdisk1: Second read-bytes 25-49
hdisk2: Third read-bytes 50-74
hdisk3: Fourth read-bytes 75-99
As you can see, each disk only requires one read request and the time to
gather all 100 bytes has been reduced 4-fold. However, there is still a
bottleneck of having the four independent data disks channel through one
adapter card. But, this can be remedied with the expensive option of
having each disk on an independent adapter card. Note the effect of using
striping: the user has now lost the usage of 3 disks that could have been
used for other volume groups.
-16 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
LVM Performance
Performance
with disk
mirroring
Disk mirroring can improve the read performance of a system, but at a cost
to the write performance. Of the two mirroring strategies, parallel and
sequential, parallel is the better of the two in terms of disk i/o. In parallel
mirroring, when a read request is received, the lvm device driver looks at
the queued requests (read and write) and finds the disk with the least
number of requests waiting to execute. This is a change from AIX 3.2,
where a complex algorithm tried to approximate the disk that would be
"closest" to the required data (regardless of how many jobs it had queued
up). In AIX 4.1, it was decided that this complex algorithm did not
significantly improve the i/o behavior of mirroring and so the complex logic
was scrapped. The user can see how this new strategy of finding the
shortest wait line would improve the read time. And with mirroring, two
independent requests to two different locations can be issued at the same
time without causing disk contention, because the requests will be issued
to two independent disks. However, with the improvement to the read
request as a result of disk mirroring and the multiple identical sources of
reads, the LVM disk driver must now perform more writes in order to
complete the write request. With mirroring, all disks that make up a mirror
are issued write commands which each disk must complete before the
LVM device driver considers a write request as complete
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Performance
Changeable
parameters
that affect LVM
performance
-- continued
There are a few parameters that the user can change per logical volume
which will affect the performance of the logical volume in terms of data
access efficiency.
From experience however, many people have different views of how to
achieve that efficiently, so there can’t be a specific "right" recommendation
given in these notes.
Inter-policy - This comes in two variations, min and max. The two choices
tells LVM how the user wishes the logical volume to be spread over the
disks in the volume group. With min, this tells LVM that the logical volume
should be spread over as few disks as possible. The max policy directs
LVM to spread the logical volume over as many disks that are defined by
the volume group and limited by the "Upper Bound" variable. Some users
try to use this variation to form a cheap version of disk striping on systems
below AIX 4.1. However, it must be stated that the Inter-policy is a
"recommendation" to the allocp binary (Partition allocation routine), not a
strict requirement. In certain cases, depending on what is free on a disk,
these allocation policies may not be achievable.
Intra-policy - There are five regions on a disk platter defined by the intrapolicy: edge, inner-edge, middle, inner-middle, and center. This policy will
tell the LVM what the preferred location of the logical volume on the disk
platter. Depending on the value also provided for inter-policy, this
preference may or may not be satisfied by LVM. Many users have different
ideas as to which portion of the disk is considered the "best", so no
recommendation is given in these notes.
Mirror write consistency check - As mentioned before, the mirror write
consistency check tracks the last 62 distinct writes to physical partitions. If
the user turns this off, they will shorten (although slightly), the path length
involved in a disk write. However, the trade-off may be inconsistent mirrors
if the system crashes during a write call.
Write verify - This by default is turned off by LVM when a logical volume is
created. If this value is turned on for a logical volume, additional time
during writes will be accumulated as the IOCC check is performed for each
write to the disk platter.
Continued on next page
-18 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Performance
Physical
Connections
Guide
-- continued
Mirroring on different disks - The default of disk mirroring is that the copies
should exist on different disks. This is for performance as well as data
integrity. With copies residing on different disks, if one disk is extremely
busy, then a read request can be completed the other copy residing on a
less busy disk. Although it might seem the cost would be the same for
writes, the section "Command tag queuing" should show that writing to two
copies on the same disk is worse than writing to two copies on separate
disks.
Mirroring across different adapters - Another method to improve disk
throughput is to mirror the copies across adapters. This will give you a
better chance of not only finding a copy on a disk that is least busy, but it
will also improve your chances of finding an adapter that is not as busy.
LVM does not realize, nor care, that the two disks do not reside on the
same adapter. If the copies were on the same adapter, the bottleneck
there is still the bottleneck of getting your data through the flow of other
data coming from other devices sharing the same adapter card. With multiadapters, the throughput through the adapter channel should improve.
Command tag queuing - This is a feature only found on scsi-2 devices. In
scsi-1, an adapter may get many requests, but will only send out one
command at a time. Thus, if the scsi device driver received three requests
for i/o, it will buffer the last two requests until the first one sent is received.
It then will pick the next one in line and issue that command. Thus, the
target device will only receive one command at a time. With command-tag
queuing on scsi-2 devices, multiple commands may be sent out to the
same device at once. The two device drivers (disk and scsi adapter) will be
capable of determining which command returned and what to do with that
command. Thus, disk i/o throughput can be improved.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Performance
-- continued
Physical
Placement of
Logical
Partitions
The one important ability of LVM is the ability to let the user dictate how on
the disk platter the logical volume should be assigned. This is done with
the map file that can be used in the "mklv" and "mklvcopy" commands.
This map file will allow the user to assign a distinct physical partition
number to a distinct logical partition number. Thus, people with different
theories on the optimal layout for data partitions can customize their
systems according to their personal preferences.
Performance
consideration
with Disk
Striping
Disk striping is introduced in AIX 4.1. This is another word to describe the
RAID 0 implementation in software. This functionality is based on the
assumption that large amounts of data can be more efficiently retrieved if
the request were broken up into smaller requests given to multiple disks.
And if the multiple disks are on multiple adapters, then the theory works
even better, as mentioned in the previous sections of mirroring across
different disks and adapters. In the previous sections, we describe the
efficiency gained for mirrors. In this case, the same efficiency is gained
with data across disks and adapters, but without mirroring. Thus there is a
savings on the write case, as compared to mirrors. But, there is a slight
loss in the read case, as compared to mirrors, because now there isn’t
more than one copy to read from if one disk is busier than the other.
Performance
summarize
To sum up previously mentioned ideas about mirroring. If you have a
system that is mainly to be used in read cases, mirroring gives you an
advantage because there is more than one version of the same data to be
used to satisfy a read request. The only downfall is that if you require just
as many writes as reads, then the system must wait for all the writes to
complete before the single write command is considered complete.
Additionally, there are two types of mirroring, parallel and sequential.
Parallel is the more efficient of the two, and is the default mirroring option
unless otherwise specified by the user. In parallel, the "best" disk is chosen
for the read request, all write requests are issued independently to each
disk that holds a copy of the data. In sequential mirroring, the same disk is
always used as the first disk to be read. Thus, all reads are guaranteed to
be issued to the "primary" disk (there is no "primary" in parallel mirroring)
and the writes must complete in a sequential order before the write is
considered complete.
-20 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Physical disk layout Power
AIX 4.3.3 and
AIX 5 IDs
This section will explore the physical disk layout on Power platform.
There are three identifiers commonly used within LVM: Physical Volume
Identifier (PVID), Volume Group Identifier (VGID), and Logical Volume
Indentifier (LVID). The last two, VGID and LVID, are closely tied. The LVID
is simply a dot "." and a minor number appended to the end of the VGID.
The VGID is a combination of the machines unique processor serial
number (uname -a) and the date that the volume group was created.
The implementation of LVM, has always been to assume that the VGID of
a system was made up of 2 32 bit words. Throughout the code however,
the VGID/LVID is represented with the system data type struct unique_id
which is made up of 4 32 bit words. However the LVM library, driver, and
commands have always assumed or enforced the notion that the last 2
words, word3 and word 4 of this structure are zeroes.
AIX 5 is now changed such that all 4 32 bit words are used for a total of
128 bit or 32 HEX digits. The MSb 32 bits are copied from the processor ID
and the remaining 96 bits are the milisecond time stamp at creation time.
AIX 4.3.3
PVID
Byte7
Byte8
Byte 6
Byte4
Byte3
Byte2
Byte1
Byte4
Byte3
Byte2
Byte1
Byte5
Byte4
Byte3
Byte2
Byte5
VGID
Byte7
Byte8
0
0
0
Byte 6
9
0
Byte5
2
7
7
LVID
Byte8
Byte9
0
0
0
Byte 7
9
0
Byte 6
2
7
Byte1
.
7
X
AIX 5
PVID
Byte16
Byte15
Byte 14
Byte 13
Byte 12
Byte11
Byte 10
Byte 9
Byte8
Byte7
Byte 6
Byte5
Byte4
Byte3
Byte2
Byte1
Byte 13
Byte 12
Byte11
Byte 10
Byte 9
Byte8
Byte7
Byte 6
Byte5
Byte4
Byte3
Byte2
Byte1
Byte 13
Byte 12
Byte11
Byte 10
Byte8
Byte7
Byte4
Byte3
Byte2
VGID
Byte16
Byte15
Byte 14
LVID
Byte17
Byte16
Byte15
Byte 14
Byte 9
Byte 6
Byte5
Byte1
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Physical disk layout Power
Example IDs
from AIX 4 and
AIX 5L
systems that
shows how IDs
are
constructed
from
processor ID
-- continued
The processor ID is 64 bit in AIX 5 the uname function cut out bit 33 to 47
such that the result is the first word and the last 16 bit of the last word.
LVID and VGID combine 64 bit processor ID and 64 bit time stamp to form
an ID. PVIDs are made of 32 bit processor ID and bits from the timestamp.
Example from AIX 5 Power system
PVID hdisk0: 00071483229d06620000000000000000
PVID hdisk1: 00071483b50bbaee0000000000000000
LVID hd1:
0007148300004c00000000e19f7c5aa3.8
LVID hd2:
0007148300004c00000000e19f7c5aa3.5
LVID hd3:
0007148300004c00000000e19f7c5aa3.7
LVID hd4:
0007148300004c00000000e19f7c5aa3.4
VGID rootvg: 0007148300004c00000000e19f7c5aa3
VGID testvg: 0007148300004c00000000e1b50bc8ec
uname -a:
000714834C00
In a AIX 4 system all the IDs are made of the MSB 32 bit of the processor
ID and 32 bit time stamp to form an ID.
Example from AIX 4.3.3 Power system
PVID hdisk0: 0009027724fdbd9f
PVID hdisk1: 0009027779fe61c6
LVID hd1:
0009027724fdc36d.8
LVID hd2:
0009027724fdc36d.5
LVID hd3:
0009027724fdc36d.7
LVID hd4:
0009027724fdc36d.4
VGID rootvg: 0009027724fdc36d
VGID datavg: 000902771db64c28
uname -a:
000902774C00
Continued on next page
-22 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Physical disk layout Power
Physical
volume, with a
logical volume
testlv defined
-- continued
The following example show a disk dump from sector 0 at a power system
uninitialized is data not written by the LVM, sections holding 00’s or
initialized are cut out for clarity. The ID’s are those listed in the previous
section.
000000 ¦ C9 C2 D4 C1 00 00 00 00 00 00 00 00 00 00 00 00
000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000070 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000080 ¦ 00 07 14 83 B5 0B BA EE 00 00 00 00 00 00 00 00 - PVID hdisk1
000090 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0001F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000200 ¦ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
00400 ¦ 39 C7 F2 9F 14 87 93 46 00 00 00 00 00 00 00 00
000410 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0005E0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
)è&))
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
(è)&'&(VWUXFWOYPBUHFGHILQHGLQOYPUHFK
(è%%&(&9*,'WHVWYJ
(è&&'
(è%$
(è-- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
è'()(&7
è
)è
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
%)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
&è)&'&(èB/90VWUXFWOYPBUHF
&è%%&(&è9*,'WHVWYJ
&è&&'è_
&è%$è$
&è-- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
')è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
(èè'()(&7
(èè
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
))è
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
)))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
è&)%$
7KH9*6$
è
)(è
))è&)%$
è&)'&'&
7KH9*'$
è(%%&(&
è
è
è$
è
è
Continued on next page
-24 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Physical disk layout Power
Disk data
continued
Guide
-- continued
)è
è%%%$((
è&
è
è
è
è
è
è
è
è
$è
%è
&è
'è
(è
)è
è
è
è
è
è$
è
)è
è&
è
)è
è&)'&'
è
)è
è&)%$
è
(è
)è&)%$
è&)'&'&
è(%%&(&
è
)è
è$
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
è
è
21A5F0 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......
21A600 ¦ 74 65 73 74 6C 76 00 00 00 00 00 00 00 00 00 00 ¦testlv
21A610 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ¦......
()è
(è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
Continued on next page
-26 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Physical disk layout Power
Disk data
continued
-- continued
)))è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
è&$è$,;/9&%MIV
èè
èè
èèF
è&èHWHVWOY
èè
èè
èè
èè7XH6HS
è$$$è
$è$è7XH6HS
%è$$è
&è'è&\PH\
'è$()(è1RQH
(èè
)èè
èè
èè
èè
èè
èè
èè
èè
èè
èè
èè
$èè
%èè
&èè
'èèEEF
(è('($'%(()èHF
)è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
è-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Uninitialized
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Physical disk layout Power
lvm_rec
structure from
file /usr/
include/
lvmrec.h
-- continued
The structure lvm_rec is used by the lvm routines to define the disk layout
struct lvm_rec
/* structure which describes the physical volume LVM record */
{
__long32_t lvm_id;
/* LVM id field which identifies whether the PV is a member of a volume group */
#define LVM_LVMID
0x5F4C564D
/* LVM id field of ASCII "_LVM" */
struct unique_id vg_id;
/* the id of the volume group to which this physical volume belongs */
__long32_t lvmarea_len;
/* the length of the LVM reserved area */
__long32_t vgda_len;
/* length of the volume group descriptor area */
daddr32_t vgda_psn [2];
/* the physical sector numbers of the beginning of the volume
group descriptor area copies on this disk */
daddr32_t reloc_psn;
/* the physical sector number of the beginning of a pool of
blocks (located at the end of the PV) which are reserved for
the relocation of bad blocks */
__long32_t reloc_len;
/* the length in number of sectors of the pool of bad block relocation blocks */
short int pv_num;
/* the physical volume number within the volume group of this physical volume */
short int pp_size;
/* the size in bytes for the partition, expressed as a power of
2 (i.e., the partition size is 2 to the power pp_size) */
__long32_t vgsa_len;
/* length of the volume group status area */
daddr32_t vgsa_psn [2];
/* the physical sector numbers of the beginning of the volume
group status area copies on this disk */
short int version;
/* the version number of this volume group descriptor and status area */
short int vg_type;
int ltg_shift;
char res1 [444];
/* reserved area */
};
If we use the string “_LVM” we can locate the above structure in the
previous disk dump an assign values to the variables
struct lvm_rec
-28 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Variable
© Copyright IBM Corp. 2000
VALUE
#define LVM_LVMID
0x5F4C564D
struct unique_id vg_id;
0007148300004C00000000E1B50BC8EC
__long32_t lvmarea_len;
00001074
__long32_t vgda_len;
00000832
daddr32_t vgda_psn [2];
00000088
daddr32_t reloc_psn;
00867C2D
__long32_t reloc_len;
00000100
short int pv_num;
0001
short int pp_size;
0018
__long32_t vgsa_len;
00000008
daddr32_t vgsa_psn [2];
00000080
int ltg_shift;
0001
char res1 [444];
Uninitialized
000008C2
000008BA
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
VGSA structure
struct vgsa_area {
#ifdef _KERNEL
struct timestruc32_t b_tmstamp;
/* Beginning time stamp
*/
#else
struct timestruc_t b_tmstamp;
#endif
/* Bit per PV
uint
*/
pv_missing[(MAXPVS + (NBPI - 1)) / NBPI];
/* Stale PP bits
*/
uchar
stalepp[MAXPVS][VGSA_BT_PV];
short
factor;
/* for pvs with > 1016 pps
*/
char
pad2[10];
/* Padding
*/
#ifdef _KERNEL
struct timestruc32_t e_tmstamp;
/* Ending time stamp
*/
#else
struct timestruc_t e_tmstamp;
#endif
};
struct big_vgsa_area {
#ifdef _KERNEL
struct timestruc32_t b_tmstamp;
/* Beginning time stamp
*/
#else
struct timestruc_t b_tmstamp;
#endif
char
b_tmbuf64bit[24]; /* Bit per PV
uint
pv_missing[(MAX_EVER_PV + (NBPI - 1)) / NBPI]; /* Stale PP bits */
*/
uchar
stalepp[MAX_EVER_PV][VGSA_BT_PV];
short
factor;
/* for pvs with > 1016 pps
*/
short
version;
/* vgsa version
*/
char
valid[4];
/* Validity string "LVM"
*/
char
pad2[824];
/* Padding
*/
char
e_tmbuf64bit[24];
#ifdef _KERNEL
struct timestruc32_t e_tmstamp;
/* Ending time stamp
*/
#else
struct timestruc_t e_tmstamp;
#endif
};
-30 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Physical disk layout IA-64
Introduction to
AIX 5L on IA64 and EFI
partitioned
disks
IA64 systems has a different design than Power system, some, if not all,
IA-64 systems will use The Extensible Firmware Interface (EFI). EFI has
defined a new disk partitioning scheme to replace the legacy DOS
partitioning support.
When booting from a disk device, the EFI firmware utilizes one or more
system partitions containing an EFI file system (FAT32) to locate EFI
applications and drivers, including the OS boot loader. These applications
and drivers provide ways to extend firmware or provide the operating
system with assistance during boot time or runtime. In addition, it is
expected that operating systems will define partitions unique to the
operating system. EFI applications, will also have the capability to display
and potentially create additional partitions before the OS is booted.
AIX traditionally has not supported partitioned disks because AIX was the
only OS running on the RS/6000 systems. Therefore the entire disk is
defined by an hdisk ODM object and /dev/hdiskn special file with a single
major and minor number assigned to the physical disk. In AIX 4.3.3 when a
disk becomes a physical volume (having a PVID) an old style MBR
(master boot record) renamed the IPL control block which contains the
PVID is written into the first sector at the disk.
The overall design for disk partitioning on AIX 5L on IA-64 is to introduce
disk partitioning at the disk driver level. An hdisk ODM object will still refer
to the physical disk, however multiple special files will be created and
associated with the partitions on the disk. Besides the EFI system
partitions, AIX 5L on IA-64 disk configure method will recognize IA-64
physical volume partitions.
AIX 5L on IA-64 supports a maximum of 4 partitions, of these one partition
can be a physical volume partition, other partitions are EFI system
partitions. Therefore only one AIX PV, and one volume group can be
defined per physical disk.
A new command, efdisk, act as a partition manager
Special files will be created for the following partition types:
• Entire physical disk n Access (used by efdisk) /dev/hdiskn_all
•
System Partition index y on physical disk n /dev/hdiskn_sy
•
Physical volume Partition on physical disk n /dev/hdiskn
•
Unknown partition index x on physical disk n /dev/hdiskn_px
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Physical disk layout IA-64
Creating new
partitions at a
IA-64 system
-- continued
AIX 5L on IA-64 will partition disks under the following circumstances:
• Under the direction of the user/administrator via the efdisk command.
• During bos install after the designation of a "boot" disk (install targets)
• When adding a disk that is not yet a physical volume to a VG
• Under the direction of the "chdev -l hdiskx -a pv=yes: command
The disk
system after a
default
installation
After installing AIX 5L on a system with one disk, the physical drive and the
/dev special files can be listed.
lsdev -Cc disk
hdisk0 Available 00-19-10 Other IDE Disk Drive
/dev/hdisk0
- hdisk0, AIX 5L PV
/dev/hdisk0_all
- The entire disk starting at block 0
/dev/hdisk0_s0
- EFI System partition 0 at disk 0
The EFI system partition holds HW information and EFI firmware data the
disk is DOS formatted and can be accessed through dos utilities as in the
example.
5L-IA64:/tmp> dosdir -D/dev/hdisk0_s0
A.OUT
BOOT.EFI
Free space: 33155072 bytes
Continued on next page
-32 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Physical disk layout IA-64
Creating
partitions with
efdisk
-- continued
After creating four partitions we can list the start block number and length
with the efdisk command.
-----------------------------------------------------Partition Index:
0
Partition Type:
Physical Volume
StartingLBA:
(0x1)
1
Number of blocks:
Partition Index:
819200 blocks
1
Partition Type:
StartingLBA:
System Partition
819201
Number of blocks:
(0x64000)
2
Partition Type:
System Partition
1228801
Number of blocks:
(0x12c001)
614400 blocks
Partition Index:
(0x96000)
3
Partition Type:
StartingLBA:
(0xc8001)
409600 blocks
Partition Index:
StartingLBA:
(0xc8000)
System Partition
1843201
Number of blocks:
(0x1c2001)
614400 blocks
(0x96000)
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Physical disk layout IA-64
Disk layout at
IA-64 systems
-- continued
The following disk dump lists the data in hex format, the six leftmost digits
is the byte offset from physical start of disk, each line list 16 bytes. The
data is read at a IBM Power system with the same utility as previous
examples, when byte swapping is mentioned it is relative to what it would
have been at a disk connected to a AIX Power system.
000000 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped
000010 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0001C0 ¦ FF FF 09 FF FF FF 01 00 00 00 00 80 0C 00 00 FF
-start LBA = 0x1
0001D0 ¦ FF FF EF FF FF FF 01 80 0C 00 00 40 06 00 00 FF
-start LBA = 0x0c8001
0001E0 ¦ FF FF EF FF FF FF 01 C0 12 00 00 60 09 00 00 FF
-start LBA = 0x12c001
0001F0 ¦ FF FF EF FF FF FF 01 20 1C 00 00 60 09 00 55 AA
-start LBA = 0x1c2001
length = 0xc8000
length = 0x064000
length = 0x096000
length = 0x096000
000200 ¦ C1 D4 C2 C9 00 00 00 00 00 00 00 00 00 00 00 00 - AMBI in ebcdic = IBMA byte swapped
000210 ¦ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - 0x200 = start LBA 1 = first part.
è'&)&(è09/B B/90E\WHVZDSSHG
è&(%)èOYPBUHFVWUXFWRIIVHWE\[
è&))(&èIURPWKHOYPVWUXFWDWSRZHUGDWD
è%$èLQSDUWLWLRQLVSODFHGDVDW39V
Continued on next page
-34 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Physical disk layout IA-64
Disk layout at
IA-64 systems
Guide
-- continued
èè'()(&7GHIHFWOLVW
èèRIIVHWE\[FRPSDUHGWR3RZHU
èè'()(&7GHIHFWOLVW
èè
è$&$)9*6$7LPHVWDPS
è
)è$&$)HQG9*6$WLPHVWDPS
è$&$%(&9*'$VWDUWWLPHVWDPS
è(&(%)9*,'IRULDYJ
è
For reference information the PVID, LVID and VGID are listed below.
$,;LDLD
39,'KGLVNFFDHHEHDG
/9,'OYFHEIHF
/9,'OYFHEIHF
9*,'LDYJFHEIHF
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-35 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Passive Mirror Write Consistency
AIX 5L Passive
Mirror Write
Consistency
The previous Mirror Write Consistency Check (MWCC) algorithm has been
in place since AIX 3.1. This original design has served the Logical Volume
Manager (LVM) well, but has always slowed the performance of mirrored
logical volumes that performed massive and varied writes. A new design is
implemented in AIX 5 to supplement the original MWCC design.
AIX 4 MWCC
algorithm
The AIX 4 MWCC method uses a table called the mwc table. This table is
kept in memory as well on the disk platter. The table has 62 entries and
each entry tracks the last 62 distinct Logical Track Group (LTG) writes. An
LTG is 128 Kilobytes. The mwc table is only concerned with writes, not
reads. The algorithm can be expressed in pseudo-code:
if (action is a write)
{
if (LTG to be written is already in the mwc table array in memory)
{
proceed and issue the write to the mirrors
wait until all mirrored writes complete
return to calling process
}
else
{
update the mwc table with this latest LTG number overwriting the
oldest LTG entry in the mwc table (in memory), write the memory
mwc table to the edge of the platter of all disks in the volume group
wait for the mwc table writes to complete - when the mwc table write of
the disk that holds the LTG in question returns, this is considered write
complete of the mwc table. issue the parallel mirror writes to all the
mirrors. wait until all mirrored writes complete and return to calling
process
}
}
else
process the read
Continued on next page
-36 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Passive Mirror Write Consistency
Guide
-- continued
MWCC usage
for recovery
The reason for having mwcc is: Recovery from a crash while i/o is
proceeding on a mirrored logical volume. By implication, this means that
mwcc is ignored for non-mirrored logical volumes. A key phrase is data "in
flight", which implies that a write has been issued to a disk and the write
order has not come back from the disk with a confirmation that the action is
complete. Thus, there is no certainty that the data did in fact get written to
the disk. mwcc tracks the last 62 write orders so that upon reboot, this
table is used to rewrite the last 62 mirror writes. It is more than likely that
all the writes finished before the system crash, however LVM goes ahead
and goes to each of the 62 distinct LTGs, reads one copy of the mirror and
writes it to the other mirror(s) that exist. Note that mwcc does not
guarantee that the absolute latest write is made available to the user.
Mwcc just guarantees that the images on the mirrors are consistent
(identical).
AIX 4 MWCC
performance
implications
The current mwcc algorithm has a penalty for heavily random writes. There
is a performance sag associated with doing an extra write for each write
you perform. A good example, taken from a customer, is a mail server that
had mirrored accounts. Thousands of users were constantly writing or
deleting files from their mail accounts. Thus, the LTG counter was
constantly being changed and written to disk. In addition to that overhead,
if the mwcc table has been dispatched to be written, new requests that
come into the LVM work queue are held until the mwcc table write returns
so that it can be updated and once more sent down to the disk platters to
be updated.
Current AIX 4
MWCC
workaround.
Currently, the only way customers can work around the performance
penalty associated with mwcc is to turn the functionality off. But in order to
insure data consistency, they must do a syncvg -f <vgname> immediately
after a system crash and reboot to synchronize data.
Since there is no mwcc table on the platter, there is no way to determine
which LTGs need resyncing, thus a forced resync of ALL partitions is
required. Omitting this synchronization may cause inconsistent data.
AIX 5 LVM
Passive Mirror
Write
Consistency
Check
The MWCC implementation in AIX 5 provides a new passive algorithm, but
only for big VGs, The reason for this is that we need space for a dirty flag
for each logical volume, and only the VGSA for big VGs provides this
space.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-37 of 64
Guide
AIX 5 Passive
MWCC
algorithm
Draft Version for review, Sunday, 15. October 2000, LVM.fm
The new MWCC algorithm set a flag when the mirrored LV is open in RW
mode, and the flag is no cleared until the last close on the device. The flag
is then examined during subsequent boots, the algorithm implemented is:
1. The user opens a mirrored logical volume.
2. The lvm driver marks a bit in the VGDA which states that for purposes
of passive mwcc, the lv is "dirty"
3. Reads and writes occur to the mirrored lv with no (traditional) mwcc
table writes
4. The machine crashes
5. Upon reboot, the volume group automatically varies on. As part of this
varyonvg, checks are made to see if dirty bits exists for each lv
6. For each logical volume that is dirty, a "syncvg -f -l <lvname>" is
performed, regardless of whether or not the user wants to do this.
Advantage:
The behavior of a mirrored write will be the same as those of a mirrored
logical volume with now mwcc. Since crashes are very rare, the need for
mwcc resync is negligible. Thus, a mostly unnecessary write (mwc table
update) will be avoided.
Disadvantage:
After a crash, the entire logical volume is considered dirty, although only a
few blocks could have changed. Until all the partitions have been
resync’ed, then the logical volume will always be considered dirty while the
logical volume is open. Additionally, reads will be a bit slower as a readthen-sync operation must be performed.
Commands
affected by the
Passive MWCC
algorithm
Varyonvg command will inform the user that a background forced sync
may be occurring with the passive MWCC recovery.
Syncvg command will inform user that a non-forced sync on a logical
volume with a passive MWCC will result in a forced background sync.
Lslv command has been altered such that the output shows if Passive
MWCC is set and active.
To set passive sync
• mklv -w p = Use Passive MWCC algorithm
• chlv -w p = Use Passive MWCC algorithm
Changes in
Kernel
extensions
due to Passive
MWCC
Three functions are changed hd_open, hd_close, and hd_ioctl:
hd_open: if the logical volume being opened is part of a big VG, it is being
opened for write, it is mirrored, and the mwcc policy is passive, the
lv_dirty_bit
-38 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
representing the logical volume minor number is marked as dirty. Multiple
settings of this may occur as multiple opens results in multiple visits to
hd_open.
hd_close: only when a logical volume is being closed for the last time, this
function is called. When this occurs, the function checks to see if the
logical volume is part of a big VG, it has more than one copy, the mwcc
policy is set to passive and the passive_mwcc_recover flag of the logical
volume is not set. If all these conditions are true, then the lv_dirty_bit of the
logical volume is cleared and the logical volume mirrors are considered
100% consistent with each other.
hd_ioctl: this will return additional status and tell the user if the logical
volume is current marked as needing to undergo or is actually undergoing
passive mwcc
recovery (all reads result in a resync of the mirrors).
The function hd_mirread is called upon the completion, successful or
otherwise, of a read of a mirrored logical volume. When entering this
function, if the passive_mwcc_recover flag is set, then the function will
search the other viable mirrors that were not read and copy the contents of
the just read mirror into those other mirrors via first set the mirrors to avoid
with the pb_mirbad variable, then calling the function hd_fixup.
The function hd_kdeflvs, which is called at varyonvg time, looks to see if
the volume group is mirrored, has the mwcc policy set to passive, and is a
big volume group. If it is, then it checks the lv_dirty_bit of that logical
volume in the VGSA. If the bit is set, then the driver notifies itself that it is
going to be in passive mwcc recovery state by setting the
passive_mwcc_recover flag to true.
Changes to allow hd_kextend to work properly with the new
LV_ACTIVE_MWC definition.
Changes in hdpin.exp
Export the call hd_sa_update so that hd_top can update the VGSA well
with the modified lv_dirty_bit as a result of hd_open or hd_close.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-39 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
AIX 5 LVM Hot Spare Disk in a Volume group.
AIX 5 Hot
Spare Disk
function
• Automatic migration of failed disks for mirrored LVs
• Ability to create spare disk pool for a VG
The hot spare function applies to mirrored LVs, non mirrored LVs on a
failing disk can not be recovered and therefore no attempt is made.
AIX 5 Hot
Spare disk
chpv
command
Chpv [-h Hotspare] ... existing flags ... PhysicalVolume
-h hotspare
• Sets the sparring characteristics of the physical volume such that the
physical volume can be used as a hot spare and the allocation
permission for physical partitions on the physical volume specified by
the PhysicalVolume parameter. This flag has no meaning for non
mirrored logical volumes. The Spare variable can be either:
• y
• Marks the disk as a hot spare disk within the VG it belongs to and
prohibits the allocation of physical partitions on the physical volume.
The disk must not have any partitions allocated to logical volumes to
be successfully marked as a hot spare disk.
• n
• Removes the disk from the hot spare pool for the volume group in
which it resides and allows allocation of physical partitions on the
physical volume.
Continued on next page
-40 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
AIX 5 LVM Hot Spare Disk in a Volume group.
AIX 5 Hot
Spare disk
chvg
command
Guide
-- continued
Chvg [-s Sync] [-h Hotspare] ... existing flags .... VolumeGroup
-h hotspare
• Sets the sparing characteristics for the volume group specified by the
VolumeGroup parameter. Either allows (yes) the automatic migration of
failed disks, or prohibits (no) the automatic migration of failed disks.
This flag has no meaning for non mirrored logical volumes
• y
• Allows the automatic migration of failed disks. Use one for one
migration of partitions from one failed disk to one spare disk. The
smallest disk in the volume group spare pool that is big enough for
one to one migration will be used.
• Y
• Allows the automatic migration of failed disks. Potentially use the
entire pool of spare disks to migrate to as apposed to a one for one
migration of partitions to a spare.
• n
• Prohibits the automatic migration of failed disks. This is the default
value for a volume group.
• r
• Removes all disks from the hotspare pool for the volume group.
-s sync
Sets the synchronization characteristics for the volume group specified by
the VolumeGroup parameter. Either allows (yes) the automatic
synchronization of stale partitions or prohibits (no) the automatic
synchronization of stale partitions. This flag has no meaning for non
mirrored logical volumes.
• y
• Attempt to automatically synchronize stale partitions.
• n
• Prohibits automatic synchronization of stale partitions. This is the
default for a volume group.
• Lsvg -p will show the status of all physical volumes in the VG.
• Lsvg will show status of current state of sparing and synchronization.
• Lspv will show if a disk is a spare.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-41 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Hot spot management
AIX 5 LVM Hot
Spot
Management
Provides tools to determine which logical partitions have high I/O traffic
and allow the migration of those logical partitions to other disks. The
benefit from this system is to:
• Improve performance by eliminating hot spots.
• The system can also be used to migrate certain logical partitions for
maintenance.
LVM Hot spot
data collection
lvmstat { -l | -v } Name [ -e | -d ] [ -F ] [ -C ] [ -c Count ] [ -s ] [ Interval [
Iterations ] ]
The lvmstat command generates reports that can be used to change
logical volume configuration to better balance the input/output load
between physical disks. By default, the statistics collection is not enabled
in the system. You must use the -e flag to enable this feature for the logical
volume or volume group in question. Enabling the statistics collection for a
volume group enables for all the logical volume in that volume group.
The first report generated by lvmstat provides statistics concerning the
time since the system was booted. Each subsequent report covers the
time since the previous report. All statistics are reported each time lvmstat
runs. The report consists of a header row followed by a line of statistics for
each logical partition or logical volume depending on the flags specified.
Flags
• -c Count
Prints only the specified number of lines of statistics.
•
-C Causes the counters that keep track of the iocnt, Kb_read and
Kb_wrtn be cleared for the specified logical volume/volume group.
•
-d Specifies that statistics collection should be disabled for the
logical volume/volume group in question.
•
-e Specifies that statistics collection should be enabled for the logical
volume/volume group in question.
•
-F
•
-l
•
-s Suppresses the header from the subsequent reports when Interval
is used.
•
-v Specifies that the Name specified is the name of the volume
group.
Causes the statistics to be printed colon-separated.
Specifies the name specified is the name of the logical volume.
Continued on next page
-42 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
LVM Hot spot management
LVM Hot Spot
lists
-- continued
The lvmstat command is useful in determining whether a physical volume
is becoming a hindrance to
performance by identifying the busiest physical partitions for a logical
volume.
The lvmstat command generates two types of reports, per Logical partition
statistics in a logical volume and per logical volume statistics in a volume
group. The reports has the following format:
# lvmstat -l hd3
Log_part
mirror#
iocnt
Kb_read
Kb_wrtn
Kbps
1
1
0
0
0
0.00
2
1
0
0
0
0.00
3
1
0
0
0
0.00
# lvmstat -v rootvg
Logical Volume
iocnt
Kb_read
Kb_wrtn
Kbps
1592
5620
880
0.05
hd9var
71
32
28
0.00
hd8
71
0
284
0.00
hd4
13
8
60
0.00
hd1
11
1
21
0.00
hd2
Migrating Hot
Spots
migratelp LVname/LPartnumber[ /Copynumber ] DestPV[/PPartNumber]
The migratelp moves the specified logical partition LPartnumber of the
logical volume LVname to the DestPV physical volume. If the destination
physical partition PPartNumber is specified it will be used, otherwise a
destination partition is selected using the intra region policy of the logical
volume. By default the first mirror copy of the logical partition in question is
migrated. A value of 1, 2 or 3 can be specified for Copynumber to migrate
a particular mirror copy.
The migratelp command fails to migrate partitions of striped logical
volumes.
Examples
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-43 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
move the first logical partitions of logical volume lv00 to hdisk1, type:
migratelp lv00/1 hdisk1
move second mirror copy of the third logical partitions of logical volume
hd2 to hdisk5, type:
migratelp hd2/3/2 hdisk5
-44 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
LVM split mirror AIX 4.3.3.
Splitting and
reintegrating a
mirror
For a long time it has been a desire to be able to make online backups,
especially in installations with mirrored volumes it’s been a requested
feature to be able to split the mirror and use one side of the mirror for
online backups. It has been possible to do a manual split and later
reintegration, but it has been rather complicated and therefore unsafe. In
AIX 4.3.3. this feature has been made available with an easy command
interface.
A mirrored LV can be divided with the chfs command, the example will split
the LV mounted on /testfs, copy number 3 will be mounted ad /backup.
chfs -a splitcopy=/backup -a copy=3 /testfs
The LV is reintegrated in two steps
# umount /backup
# rmfs /backup
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-45 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Variable logical track group (LTG)
AIX 5
introduce
Variable LTG
size to
improve disk
performance
Today the Logical Volume Manager (LVM) shipped with all versions of AIX
has a constant max transfer size of 128K also know within LVM as the
Logical Track Group (LTG). All IO within LVM must be on a Logical Track
Group boundary. When AIX was first released all disks supported 128K.
Today many disks are going beyond 128K and the efficiency of many disks
such as RAID Arrays are impacted if the IO is not a multiple of the stripe
size and the stripe size is normally larger than 128K.
The enhancements in AIX 5 will allow a VG LTG size to be specified at VG
creation time. The enhancements allows the VG LTG to be changed when
volume group is active but no logical volumes are open. The Default LTG
size is still 128K, other sizes must be requested by the user. Mkvg/chvg
will fail if the specified LTG is larger than the max_transfer size of the
target disk(s). Extendvg will fail if the specified LTG is larger than the
max_transfer size of the target disk(s). The change of LTG size will not be
allowed for disks active in concurrent mode.
Variable LTG
size and
commands
LTG now supports the following sizes
• 128K - Default value
• 256K
• 512K
• 1024K
Variable LTG commands:
• mkvg -L <size> - create a new volumegroup with LTGsize = <size>
• chvg -L <size> - change a volumegroup to LTGsize = <size>
• lsvg <volume group> will display the LTG size
-46 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
LVM command overview
High level
commands
• varyonvg executable
• extendvg shell script
• extendlv shell script
• mkvg shell script
• mklv shell script
• lsvg executable
• lspv executable
• lslv executable
Internal
commands
• getlvodm executable
• getvgname executable
• putlvodm executable
• synclvodm executable
• allocp executable
• mapread
• map_alloc
• migfix executable
Low level
commands
• lcreatevg executable
• lmigratelv executable
• lquerypv executable
• lqueryvg executable
• lextendlv executable
• lreducelv executable
• lquerylv executable
• lqueryvgs executable
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-47 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Problem Determination
LVM Problem
Determination
The Purpose of this section is
•
What is the root cause of the error?
•
Can this problem be distilled to the simplest case?
•
What has been lost and what is left behind?
•
Is this situation repairable?
Because in most cases, each LVM problem case is specific to a user and
their environment, this section isn't a "how-to" section. Instead, it's mostly
a checklist section which will help the user gather necessary information to
rationally determine the root cause of the problem and if the problem can
be fixed in the field, rather than sending to Level 3 software support. And if
the problem must be sent to Level 3, this will give suggested information
that would speed the problem determination/solution given by Level 3.
Find out What
is the root
cause of the
error?
The first question to be asked is if this problem is really in the LVM layer.
The sections that detail how an I/O request is handed down from layer to
layer might help clarify all the sections that must be considered. The most
important initial determination is whether the problem is in above the LVM
layer, in the LVM layer, or below the LVM layer. For instance, an application
program such as Oracle or HACMP/6000 that accesses the LVM directly
might have a problem. If you can determine what actions these failing
programs are attempting to the LVM, then try to recreate this action by
hand using a method that is not based on those application programs. If
your attempt by hand works, then the focus of the problem shifts "up" to
the application program. Obviously if it fails, then you isolated the problem
at the LVM layer (or below). Or, the problem could simply be corruption to
the data needed by LVM; the programs are behaving correctly, but data
needed by LVM is corrupted which is causing LVM to behave strangely. An
additional bonus to the field investigator is the fact that most high-level
commands are shell scripts. Thus, if they are familiar with shell
programming, they may turn on the shell output and what the execution of
the shell commands to observe the failure point. This information might
produce additional helpful information to the problem record. Finally, if
there is corruption or loss of data required by LVM (such as a disk
accidently erased from a volume group), it helps to find the exact steps
performed (or even not performed) by the user so that the investigator can
deduce the state of the system and what useful LVM information is left
behind.
Continued on next page
-48 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Problem Determination
Guide
-- continued
Can this
problem be
distilled to the
simplest case?
Many times problem reports from the field to Level 3 concerning LVM are
difficult to investigate because clarification is required (to determine the
root cause of the problem). Or, the problem is described with the complex
user configurations. If it is possible, the most basic action of the LVM is the
one that should be investigated. This is not always possible as some
problem may only be exposed when running in a complex environment.
However, whenever possible one should try to distill the case into how the
action to a logical volume is causing misbehavior by the system. And in
that clarification, a non-LVM root cause may be discovered instead.
What has been
lost and what
has been left
behind?
This type of question is typically asked of the system when some sort of
accident has resulted in data corruption or loss of LVM required
information. Given the state of the system before the corruption, the steps
that most likely caused the corruption, and the current state of the
machine, one can deduce what is left to work with. Sometimes one will
receive conflicting information. This is because part of the ODM disagrees
with part of the VGDA. The ODM is the one that is easily alterable
(compared to the VGDA).
Is this
situation
repairable?
Sometimes you have enough information to know what is missing and
what should be done to repair the system. However, the design of ODM,
the system configurator, and LVM prevents the repair. By fixing one
problem, another is spawned. And, one is caught in a deadlock situation
that cannot be fixed unless one wrote very specific kernel code to repair
the internal aspects of the LVM (most likely the VGDA). This is not a trivial
solution, but it is possible. It is only through experience that a judgement
can be made if recovery can be attempted.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-49 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Problem Determination
Problem
Recovery
-- continued
• Warn the user of the consequences
• Gather all possible data
• Save off what can be saved
• Each case is different, so must be the solution
Although this might seem a trivial step, when you attempt problem
recovery, most of the time you must alter or destroy an important internal
structure within the LVM (such as the VGDA). Once this is done, if the
recovery attempt didn't work, the user's system is usually in worse shape
than before the recovery attempt. Many users will decline the recovery
attempt once this warning is given. However, it is better to warn them
ahead of time!
Gather all
possible data
While the volume group is still partially accessible, gather all possible data
about the current volume group. The VGDA will provide information about
missing logical volumes, which will be important. Once the recovery
procedure starts, important reference information such as that gathered
from the VGDA will be lost for good. And if your information is incomplete,
then you may be stuck with no where to go.
Save off what
can be saved
Before starting the recovery, make a copy of files that can be restored in
case something goes wrong. A good example would be something like the
ODM database files that reside in /etc/objrepos. Sometimes the recovery
steps involves deleting information from those databases. And once
deleted, if one is unsure of their form, one can't try to recreate some of the
structures or values.
Each case is
different, so
must each
solution be
Since each LVM problem is most likely going to be unique for that system,
these notes cannot provide a list of steps one would take in a repair. Once
again, the recovery steps must be based on individual experiences with
LVM. The LVM lab exercise on recovery provides a glimpse of the
complexity and information required to repair a system. However, this lab
is just an example, not a template of how all fixes should be attempted.
-50 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Trace LVM commands with the trace command
Tracehook 105
Trace HOOK 105 : HKWD KERN LVM
This event is recorded by the Logical Volume Manager for selected events.
LVM relocingblk bp=value pblock=value relblock=value
• Encountered relocated block
• bp=value, Buffer pointer
• pblock=value, Physical block number
• relblock=value, Relocated block number.
LVM oldbadblk bp=value pblock=value state=value bflags
• Bad block waiting to be relocated
• bp=value, Buffer pointer
•
pblock=value, Physical block number
•
state=value, State of the physical volume
•
bflags, Buffer flags are defined in the sys/buf.h file.
LVM badblkdone bp=value
• Block relocation complete
• bp=value, Buffer pointer.
LVM newbadblk bp=value badblock=value error=value bflags
• New bad block found
• bp=value, Buffer pointer
• badblock=value, Block number of bad block
• error=value, System error number (the errno global variable)
• bflags, Buffer flags are defined in the sys/buf.h file.
LVM swreloc bp=value status=value error=value retry=value
• Software relocating bad block
• bp=value, Buffer pointer
• status=value, Bad block directory entry status
• error=value, System error number (the errno global variable)
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-51 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
• retry=value, Relocation entry count.
LVM resyncpp bp=value bflags
• Resyncing Logical Partition mirrors
• bp=value, Buffer pointer
• bflags, Buffer flags are defined in the sys/buf.h file.
Continued on next page
-52 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Trace LVM commands with the trace command
Trace hook 105
continued
-- continued
LVM open device name flags=value
Open device name, Name of the device
flags=value, Open file mode.
LVM close device name
Close device name, Name of the device.
LVM read device name ext=value
Read device name, Name of the device
ext=value, Extension parameters.
LVM write device name ext=value
Write device name, Name of the device
ext=value, Extension parameters.
LVM ioctl device name cmd=value arg=value
ioctl device name, Name of the device
cmd=value, ioctl command
arg=value, ioctl arguments.
Example on a trace -a --j105
ID
ELAPSED_SEC
DELTA_MSEC
APPL
SYSCALL KERNEL
INTERRUPT
001
0.000000000
0.000000 TRACE ON channel 0
Mon Sep 18 21:52:50 2000
105
20.598330739
6.109275
LVM close:
rloglv00
105
20.598415445
0.084706
LVM close:
rlv00
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-53 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Trace LVM commands with the trace command
Trace hook
10B
-- continued
10B : HKWD KERN LVMSIMP
This event is recorded by Logical Volume Manager for selected events.
Recorded Data
Event:
LVM rblocked: bp=value
Request blocked by conflict resolution
bp=value
Buffer pointer.
LVM pend: bp=value resid=value error=value bflags
End of physical operation
bp=value, Buffer pointer
resid=value, Residual byte count
error=value, System error number (the errno global variable)
bflags, Buffer flags are defined in the sys/buf.h file.
• LVM lstart: device name bp=value lblock=value bcount=value bflags
opts: Value
• Start of logical operation
• device name, Device name
• bp=value, Buffer pointer
• lblock=value, Logical block number
• bcount=value, Byte count
• bflags, Buffer flags are defined in the sys/buf.h file
• opts: value, Possible values:
•
WRITEV, HWRELOC, UNSAFEREL, RORELOC, NO_MNC,
MWC_RCV_OP, RESYNC_OP, ,AVOID_C1, AVOID_C2,
AVOID_C3
Example on a trace -a --j10b:
ID
-54 of 64 AIX 5L Internals
ELAPSED_SEC
DELTA_MSEC
APPL
SYSCALL KERNEL
INTERRUPT
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
001
0.000000000
10B
0.007512611
10B
0.007523970
B_WRITE
10B
© Copyright IBM Corp. 2000
8.968758818
Guide
0.000000 TRACE ON channel 0
7.512611
Mon Sep 18 21:52:50 2000
LVM pend:pbp=F100 00971615E580 resid=0000 error=0000 B_WRITE
0.011359
8961.234848
LVM lend:rhd9var lbp=F10000 971E17E1A0 resid=0000 error=0000
LVM lstart: rhd4 lbp=F100009
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-55 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
LVM Library calls
List of Logical
Volume
Subroutines
The library of LVM subroutines is a main component of the Logical Volume
Manager.
LVM subroutines define and maintain the logical and physical volumes of a
volume group. They are used by the system management commands to
perform system management for the logical and physical volumes of a
system. The programming interface for the library of LVM subroutines is
available to anyone who wishes to provide alternatives to or expand the
function of the system management commands for logical volumes.
Note: The LVM subroutines use the sysconfig system call, which requires
root user authority, to query and update kernel data structures describing a
volume group. You must have root user authority to use the services of the
LVM subroutine library.
The following services are available:
•
lvm_querylv Queries a logical volume and returns all pertinent
information.
•
lvm_querypv Queries a physical volume and returns all pertinent
information.
•
lvm_queryvg Queries a volume group and returns pertinent
information.
•
lvm_queryvgs Queries the volume groups of the system and returns
information for groups that are varied on-line.
-56 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
logical volume device driver LVMDD
LVM logical
volume device
driver
The Logical Volume Device Driver (LVDD) is a pseudo-device driver that
operates on logical volumes through the /dev/lvn special file. Like the
physical disk device driver, this pseudo-device driver provides character
and block entry points with compatible arguments. Each volume group has
an entry in the kernel device switch table. Each entry contains entry points
for the device driver and a pointer to the volume group data structure. The
logical volumes of a volume group are distinguished by their minor device
numbers.
• Attention: Each logical volume has a control block located in the first
512 bytes. Data begins in the second 512-byte block. Care must be
taken when reading and writing directly to the logical volume, because
the control block is not protected from writes. If the control block is
overwritten, commands that use it can no longer be used.
Character I/O requests are performed by issuing a read or write request on
a /dev/rlvn character special file for a logical volume. The read or write is
processed by the file system SVC handler, which calls the LVDD ddread or
ddwrite entry point. The ddread or ddwrite entry point transforms the
character request into a block request. This is done by building a buffer for
the request and calling the LVDD ddstrategy entry point.
Block I/O requests are performed by issuing a read or write on a block
special file /dev/lvn for a logical volume. These requests go through the
SVC handler to the bread or bwrite block I/O kernel services. These
services build buffers for the request and call the LVDD ddstrategy entry
point. The LVDD ddstrategy entry point then translates the logical address
to a physical address (handling bad block relocation and mirroring) and
calls the appropriate physical disk device driver.
On completion of the I/O, the physical disk device driver calls the iodone
kernel service on the device interrupt level. This service then calls the
LVDD I/O completion-handling routine. Once this is completed, the LVDD
calls the iodone service again to notify the requester that the I/O is
completed.
The LVDD is logically split into top and bottom halves. The top half
contains the ddopen, ddclose, ddread, ddwrite, ddioctl, and ddconfig entry
points. The bottom half contains the ddstrategy entry point, which contains
block read and write code. This is done to isolate the code that must run
fully pinned and has no access to user process context. The bottom half of
the device driver runs on interrupt levels and is not permitted to page fault.
The top half runs in the context
of a process address space and can page fault.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-57 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Disk Device Calls
scsidisk, SCSI
Disk Device
Driver
This driver supports the small computer system interface (SCSI) and the
Fibre Channel Protocol for SCSI (FCP) fixed disk, CD-ROM (compact disk
read only memory), and read/write optical (optical memory) devices.
Syntax
#include <sys/devinfo.h>
#include <sys/scsi.h>
#include <sys/scdisk.h>
Device-Dependent Subroutines
Typical fixed disk, CD-ROM, and read/write optical drive operations are
implemented using the open, close, read, write, and ioctl subroutines.
open and close Subroutines:
The openx subroutine is intended primarily for use by the diagnostic
commands and utilities. Appropriate authority is required for execution.
The ext parameter passed to the openx subroutine selects the operation to
be used for the target device. The /usr/include/sys/scsi.h file defines
possible values for the ext parameter.
rhdisk Special File Provides raw I/O access to the physical volumes (fixeddisk) device driver.
The rhdisk special file provides raw I/O access and control functions to
physical-disk device drivers for physical disks. Raw I/O access is provided
through the /dev/rhdisk0, /dev/rhdisk1, ..., character special files.
Direct access to physical disks through block special files should be
avoided. Such access can impair performance and also cause data
consistency problems between data in the block I/O buffer cache and data
in system pages. The /dev/hdisk block special files are reserved for system
use in managing file systems, paging devices and logical volumes.
The r prefix on the special file name indicates that the drive is to be
accessed as a raw device rather than a block device.
-58 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
Guide
-59 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Disk low level Device Calls such as SCSI calls
SCSI Adapter
Device Driver
The SCSI device driver has access to the physical disk (if SCSI disk). The
driver support data transfers via read and write and control commands via
ioctl calls. The diskDD use the Adapter device driver to access and control
the physical storage device.
Supports the SCSI adapter.
Syntax
<#include /usr/include/sys/scsi.h>
<#include /usr/include/sys/devinfo.h>
Description
The /dev/scsin and /dev/vscsin special files provide interfaces to allow
SCSI device drivers to access SCSI devices. These files manage the
adapter resources so that multiple SCSI device drivers can access devices
on the same SCSI adapter simultaneously. The /dev/vscsin special file
provides the interface for the SCSI-2 Fast/Wide Adapter/A and SCSI-2
Differential Fast/Wide Adapter/A, while the /dev/scsin special file provides
the interface for the other SCSI adapters. SCSI adapters are accessed
through the special files /dev/scsi0, /dev/scsi1, .... and /dev/vscsi0, /dev/
vscsi1, ....
The /dev/scsin and /dev/vscsin special files provide interfaces for access
for both initiator and target mode device instances. The host adapter is an
initiator for access to devices such as disks, tapes, and CD-ROMs. The
adapter is a target when accessed from devices such as computer
systems, or other devices that can act as SCSI initiators.
For further information look in
Kernel and Subsystems Technical Reference, Volume 2
and Files Reference manual.
-60 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Guide
Exercises
Examine the
physical disk
layout of a
logical volume
and a physical
volume.
Use a tool such as edhx, hexit, dd or other to Look at a physical volume,
Idintify the PVID, the VGID, and the LVM structure.
Hint: which device should you use to access these data. It may be esier to
copy data from the drive to a file with the dd command.
dd if=/dev/xxx of=/tmp/Myfile bs=1020k count=<number of MB>
Use another device to look at the logical volume, and does the data match
those from the physical device.
Examinine the
impact of LVM
Passive Mirror
Write
Consistency
This exercise will look at the perfromace impact enabling and disabling
MWC, to do do this we need a reproduceable write load. one way to get
this is to write a C program to create the load remember the file has to be
realy big to exceed the cache size or, force a sync to occur before
terminating.
Sample C code to write a big file:
void writetstfile()
{
char buffer[512];
char *filename = "/test/a_large_file";
register int i;
int fildes;
/* for (i=0;i<38;i++) buffer[i] = buf[i]; */
if ((fildes = creat(filename,0640)) < 0) {
printf("cannot create file \n");
exit(1);
}
else {
close(fildes);
if ((fildes = open(filename,1)) < 0) {
printf("cannot open file for write \n");
exit(1);
}
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-61 of 64
Guide
Draft Version for review, Sunday, 15. October 2000, LVM.fm
}
for (i=0; i< BLOCKS;i++)
if(write(fildes,buffer,512) < 0) {
printf("error writeng block %d\n",i);
exit(1);
}
Continued on next page
-62 of 64 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Exercises
Guide
-- continued
Examinine the
function of
LVM LTG
The LTG is the LVM Logical Track Group, the amount of data read or
written to the disk in each operation. Try to monitor the the data and the
number of disk transactions per. second during IO. The IO and the disk
transactions per second can be monitored with the iostat command.
Test the split
mirror facility
Test the “Splitting and reintegrating” facility of a mirror. First create a
mirrored LV, and write data to it. Then split the mirror and access data from
both sides. Change data at the “primary side”, and then reintrgrate the
mirror, what happens?
How fast are the mirrors reintegrated?
are they realy synchronized?
Exercise Trace
LVM system
activity.
In this exercise we will use the trace command to monitor LVM activity
start, stop, and list the results from a LVM trace with the commands
trace -a -j105 -j10b
trcstop
trcrpt > <filename>
Try to Unmount a filesystem, mount the filesystem again, create a file, and
write data into the file to create some activity in the LVM trace file.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-63 of 64
Guide
-64 of 64 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, LVM.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Unit 12. Enhanced Journaled File System
Objectives
After completing this unit, you should be able to
• List the difference between the terms aggregate and fileset.
• Identify the various data structures that make up the JFS-2
filesystem.
• Use the fsdb command to trace the various data structures that
make up the logical and virtual file system.
References
SCnn-nnnn
Title of Reference
http://www.yoururl.com
WEB Page Name
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
J2 - Enhanced Journaled File System
Introduction
The Enhanced Journaled File System (JFS2), is an extent based
Journaled File System. It is the default filesystem on IA-64 systems and is
available on the Power based systems. Currently the default on Power
systems is the Journaled File System (JFS).
Numbers
The following table list some general information about JFS2
Function
Block Size
Architectural max. files size
Max. file size tested
Max. file system size
Number of Inodes
Directory Organization
-2 of 36 AIX 5L Internals
Value
512 - 4096 Configurable block size
4 Petabytes
1 Tetabytes
1 Tetabytes
Dynamic, limited by disk space.
B-tree
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Aggregate
Introduction
The term aggregate is defined in this section. The layout of a JFS2
aggregate is described.
Definitions
JFS2 separates the notion of a disk space allocation pool, called an
aggregate, from the notion of a mountable file system sub-tree, called a
fileset. The rules that define aggregates and filesets in JFS2 are:
• There is exactly one aggregate per logical volume.
• There may be multiple filesets per aggregate.
• In The first release of AIX 5L, only one fileset per aggregate is
supported,.
• The meta-data has been designed to support multiple filesets, and this
feature may be introduced in a future release of AIX 5.
The terms aggregate and fileset in this document correspond to their DCE/
DFS (Distributed Computing Environment Distributed File System) usage.
Aggregate
block size
An aggregate has a fixed block size (number of bytes per block) that is
defined at configuration time. The aggregate block size defines the
smallest unit of space allocation supported on the aggregate. The block
size cannot be altered, and must be no smaller than the physical block size
(currently 512 bytes). Legal aggregate block sizes are:
•
512 bytes
• 1024 bytes
• 2048 bytes
• 4096 bytes.
Do not confused aggregate block size with the logical volume block size,
which defines the smallest unit of I/O.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Aggregate
Aggregate
layout
-- continued
The following diagram and table details the layout of the aggregate.
1KB
(One Aggregate Block)
Note: Aggregate Block Size is 1K in this example.
RESERVED
Aggregate
Block #
0
31
32 Inodes (16KB)
Aggregate Inode Table; inode numbers shown
Primary
Aggregate
Superblock
32
Control Page
36
40
Secondary
aggregate
superblock
4
6
8
10 12 14 16 18 20 22 24 26 28 30
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
aggr inode #1: “self”
owner: root
perm: -rwx-----etc: blah blah
size: 8192
offset: 0
addr: 36
length: 8
Secondary
Aggregate
Superblock
60
aggr inode #2: block map
owner: root
perm: -rwx-----etc: blah blah
size: 16384
aggr inode #16: fileset 0
owner: root
perm: -rwx-----etc: blah blah
size: 12288
offset: 0
addr: 64
length: 16
xad entries
(8 total)
ixd Section
length[0]: 16
addr[0]: 44
length[1]: 0
addr[1]: 0
...
Persistent Map
0xf8008000
0x00000000
...
Working Map
0xf8008000
0x00000000
...
Control Section
iagnum: 0
Primary
aggregate
superblock
2
1
44
1st extent of Aggregate Inode Allocation Map
Part
Reserved area
0
IAG
offset:
addr:
length:
offset:
addr:
length:
0
8
240
8192
4
10284
aggr inode #16: fileset 1
owner: root
perm: -rwx-----etc: blah blah
size: 8192
offset: 0
addr: 8
length: 5992
Function
A 32K area at the front not used by JFS2. The first
block is used by the LVM.
The primary aggregate superblock (defined as a
struct superblock) contains aggregate-wide
information such as the:
• size of the aggregate
• size of allocation groups
• aggregate block size
The superblock is at fixed locations, which allows
us to always be able to find these without
depending on any other information.
The secondary aggregate superblock is a direct
copy of the primary aggregate superblock. The
secondary aggregate superblock is used if the
primary aggregate superblock is corrupted.
Continued on next page
-4 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Aggregate
Guide
-- continued
Aggregate layout
(continued)
Part
Aggregate inode
table
Secondary
aggregate inode
table
Aggregate inode
allocation map
Secondary
aggregate inode
allocation map
Block allocation
map
fsck working
space
In-line Log,
Function
Contains inodes that describe the aggregate-wide
control structures these inodes are described
below.
Contains replicated inodes from the Aggregate
Inode Table. Since the inodes in the Aggregate
Inode Table are critical for finding any file system
information they will each be replicated in the
Secondary Aggregate Inode Table. The actual
data for the inodes will not be repeated, just the
addressing structures used to find the data and
the inode itself.
Describes the Aggregate Inode Table. It contains
allocation state information on the aggregate
inodes as well as their on-disk location.
Describes the Secondary Aggregate Inode Table.
Describes the control structures for allocating and
freeing aggregate disk blocks within the
aggregate. The Block Allocation Map maps oneto-one with the aggregate disk blocks.
Provides space for fsck to be able to track the
aggregate block allocations. This space is
necessary - for a very large aggregate there might
not be enough memory to track this information in
memory when fsck is run. The space is described
by the superblock. One bit is needed for every
aggregate block. The fsck working space always
exists at the end of the aggregate.
Provides space for logging of the meta-data
changes of the aggregate. The space is described
by the superblock. The in-line log always exist
following the fsck working space.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Aggregate
Aggregate
Inodes
-- continued
When the aggregate is initially created, the first inode extent is allocated,
additional inode extents are allocated and de-allocated dynamically as
needed. These aggregate Inodes each describe certain aspects of the
aggregate itself, as follows:
Inode #
0
1
2
3
4 - 15
16 -
-6 of 36 AIX 5L Internals
Description
Reserved
Called the “self” inode, this inode describes the aggregate
disk blocks comprising the aggregate inode map. This is a
circular representation, in that aggregate inode one is itself
in the file that it describes. The obvious circular
representation problem is handled by forcing at least the
first aggregate inode extent to appear at a well-known
location, namely, 4K after the Primary Aggregate
Superblock. Therefore, JFS2 can easily find Aggregate
Inode one, and from there it can find the rest of the
Aggregate Inode table by following the B+–tree in inode one
Describes the Block Allocation Map.
Describes the In-line Log when mounted. This inode is
allocated but no data is saved to disk.
Reserved for future extensions.
Starting at aggregate inode 16 there is one inode per fileset,
the Fileset Allocation Map Inode. These inodes describe the
control structures that represent each fileset. As additional
filesets are added to the aggregate, the aggregate inode
table itself may have to grow to accommodate additional
fileset inodes
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Allocation Groups
Introduction
Allocation Groups (AG) divide the space on an aggregate into chunks, and
allow JFS2 resource allocation policies to use well known methods for
achieving good JFS2 I/O performance.
Allocations
policies
When locating data on the disk JFS2 will attempt to:
• Group disk blocks for related data and inodes close together.
• Distribute unrelated data throughout the aggregate.
Allocation
Group Sizes
Allocation group sizes must be selected which yield Allocation Groups that
are sufficiently large to provide for contiguous resource allocation over
time. The allocation group size is stored in the aggregate superblock. The
rules for setting the allocation group size is:
• maximum number of allocation groups per aggregate is 128
• minimum size of an allocation group is 8192 aggregate blocks
• The allocation group size must always be a power of 2 multiple of the
number of blocks described by one dmap page. (i.e. 1, 2, 4, 8,... dmap
pages)
Partial
Allocation
Group
An aggregate whose size is not a multiple of the allocation group size
contains a partial allocation group - it is not fully covered by disk blocks.
This partial allocation group will be treated as a complete allocation group,
except we mark the non-existent disk blocks allocated in the Block
Allocation Map.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Filesets
Introduction
A fileset is a set of files and directories that form an independently
mountable sub-tree, equivalent to a Unix file system file hierarchy. A fileset
is completely contained within a single aggregate.
Layout
The following illustration and table details the layout of a fileset.
Filese Inode Table
Control Page
240
fileset #0:
AG Free Inode List
AG 0
inofree:
extfree:
numinos:
numfree:
1
1
32
28
1
inofree:
extfree:
numinos:
numfree:
-1
-1
0
0
2
inofree:
extfree:
numinos:
numfree:
-1
-1
0
0
128
inofree:
extfree:
numinos:
numfree:
-1
-1
0
0
Part
Fileset Inode
table
Fileset Inode
allocation map
Inodes
0
2
4
6
8
10 12 14 16 18 20 22 24 26 28 30
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31
IAG
244
IAG
248
264
2nd Half of
Fileset Superblock
Information
Fileset Inode Allocation Map: 1st extent
Control Section
iagnum: 0
Working Map
0xf0000000
0xffffffff
...
10284
Fileset Inode Allocation Map: 2nd extent
IAG Free List: 1st entry
fileset inode #2:
root directory
owner: root
perm: -rwx-----etc: blah blah
size: 4096
idotdot:2
Control Section
iagnum: 1
iagfree: -1
Working Map
0xffffffff
0xffffffff
...
Persistent Map
0xf0000000
0xffffffff
...
Persistent Map
0xffffffff
0xffffffff
...
ixd Section
length[0]: 16
addr[0]: 248
length[1]: 0
addr[1]: 0
...
ixd Section
length[0]: 0
addr[0]: 0
length[1]: 0
addr[1]: 0
...
Function
Contains inodes describing the fileset-wide control
structures. The Fileset Inode Table logically
contains an array of inodes.
A Fileset Inode Allocation Map which describes
the Fileset Inode Table. The Fileset Inode
Allocation Map contains allocation state
information on the fileset inodes as well as their
on-disk location.
Objects. Every JFS2 object is represented by an
inode, which contains the expected object-specific
information such as time stamps, file type (regular
vs. directory, etc.). They also “contain” a B+–tree
to record the allocation of extents. Note
specifically that all JFS2 meta data structures
(except for the superblock) are represented as
“files.” By reusing the inode structure for this data,
the data format (on-disk layout) becomes
inherently extensible.
Continued on next page
-8 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Filesets
-- continued
Super Inode
Super Inodes found in the aggregate inode table (#16 and greater)
describe the Fileset Inode Allocation Map and other fileset information
resides in the Aggregate Inode Table. Since the Aggregate Inode Table is
replicated there is also a secondary version of this inode which points to
the same data.
Inodes
When the fileset is initially created, the first inode extent is allocated,
additional inode extents are allocated and de-allocated dynamically as
needed. The inodes in a fileset are allocated as follows:
Fileset
Inode #
Description
0
1
reserved
additional fileset information that would not fit in the Fileset
Allocation Map Inode in the Aggregate Inode Table.
The root directory inode for the fileset.
The ACL file for the fileset.
Fileset inodes from four onwards are used by ordinary fileset
objects, user files, directories, and symbolic links.
2
3
4-
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Extents
Introduction
Disk space in a JFS2 filesystem is allocated in a sequence of contiguous
aggregate blocks called an extent.
Extent rules
An extent is:
• made up of a series contiguous aggregate blocks.
• variable in size and can range from 1 to 223 aggregate blocks.
• wholly contained within a single aggregate
• large extents may span multiple allocation groups.
• indexed in a B+-tree.
Extent
Allocation
Descriptor
Extents are described in an xad structure. The two main values
describing an extent, its length, and its address. In an xad both the length
and address are expressed in units of the aggregate block size. Details of
the xad data structure are shown below.
struct xad {
uint8
uint16
uint40
uint24
uint40
};
Member
xad_flag;
xad_reserved;
xad_offset;
xad_length;
xad_address;
Description
xad_flag
Flags set on this extent. See /usr/include/j2/j2_xtree.h for
a list of flags.
xad_reserved
Reserved for future use.
xad_offset
Extents are generally grouped together to from a larger
group of disk blocks. The xad_offset, describes the
logical byte address this extent represents in the larger
group.
xad_length
A 24-bit field, containing the length of the extent in
aggregate blocks. An extent can range in size from 1 to
224-1 aggregate blocks.
xad_address
A 40-bit field containing the address of the first block of
the extent. The address is in units of aggregate blocks
and is the block offset from the beginning of the
aggregate.
Continued on next page
-10 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Extents
Guide
-- continued
Allocation
Policy
In general, the allocation policy for JFS2 tries to maximize contiguous
allocation by allocating a minimum number of extents, with each extent as
large and contiguous as possible. This allows for larger I/O transfer
resulting in improved performance. However in special cases this is not
always possible. For example copy-on-write clone of a segment will cause
a contiguous extent to be partitioned into a sequence of smaller
contiguous extents. Another case is restriction of the extent size. For
example the extent size is restricted for compressed files since we must
read the entire extent into memory and decompress it. We have a limited
amount of memory available so we must ensure we will have enough room
for the decompressed extent.
Fragmentation
An extent based file system combined with a user-specified aggregate
block size allows JFS2 to not have separate support for internal
fragmentation. The user can configure the aggregate with a small
aggregate block size (e.g., 512 bytes) to minimize internal fragmentation
for aggregates with large numbers of small size files.
A defragmentation utility will be provided to reduce external fragmentation
which occurs from dynamic allocation/de-allocation of variable size
extents. This allocation and de-allocation can result in disconnected
variable size free extents all over the aggregate. The defragmentation
utility will coalesce multiple small free extents into single larger extents.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Binary Trees of Extents
Introduction
Objects in JFS2 are stored in groups of extents arranged in binary trees.
The concepts on binary trees are introduced in this section.
Trees
Binary trees consists of nodes arranged in a tree structure. Each node
contains an header describing the node. A flag in the node header
identifies the role of the node in the tree.
Root node
Header
flags=BT_ROOT
Internal
node
Header
Leaf node
Header
flags=
BT_LEAF
flags=
BT_INTERNAL
Leaf node
Header
Leaf node
Header
flags=
BT_LEAF
flags=
BT_LEAF
Array of extent
descriptors
xad
xad
xad
Header flags
Array of extent
descriptors
Array of extent
descriptors
xad
xad
xad
xad
xad
xad
This table describe the binary tree header flags.
Flag
BT_ROOT
BT_LEAF
BT_INTERNAL
Description
The root or top of the tree.
The bottom of a branch of a tree. Leaf nodes point to
the extents containing the objects data.
An internal node points to two or more leaf nodes or
other internal nodes.
Continued on next page
-12 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Binary Trees of Extents
Why B+-tree
-- continued
B+–trees are used in JFS2 to help performance by:
• providing fast reading and writing of extents - the most common
operations.
• fast search for reading a particular extent of a file.
• efficient append or insert of an extent in a file.
• efficient for traversal of an entire B+–tree
B+-tree index
There is one generic B+–tree index structure for all index objects in JFS2
except for directories. The data being indexed depends upon the object.
The B+–tree is keyed by offset of the xad structure of the data being
described by the tree. The entries are sorted by the offsets of the xad
structures, each of which is an entry in a node of a B+–tree.
Root node
header
The file j2_xtree.h describes the header for the root of the B+–tree in struct
xtpage_t.
#define XTPAGEMAXSLOT
256
typedef union {
struct xtheader {
int64
next;
/* 8: */
int64
prev;
/* 8: */
uint8
flag;
/* 1: */
uint8
rsrvd1;
/* 1: */
int16
nextindex;
/* 2: next index = # of entries */
int16
maxentry;
/* 2: max number of entries */
int16
rsrvd2;
/* 2: */
pxd_t
self;
/* 8: self */
} header;
xad_t
/* (32) */
xad[XTPAGEMAXSLOT]; /* 16 * maxentry: xad array */
} xtpage_t;
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Binary Trees of Extents
Leaf node
header
-- continued
The file j2_btree.h describes the header for an internal node or a leaf node
in struct btpage_t.
typedef struct {
int64
next;
int64
prev;
uint8
flag;
uint8
rsrvd[7];
int64
self;
uint8
entry[4064];
} btpage_t;
-14 of 36 AIX 5L Internals
/*
/*
/*
/*
/*
/*
8: right sibling bn */
8: left sibling bn */
1: */
7: type specific */
8: self address */
4064: */
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
inodes
Overview
Every file on a JFS2 filesystem is describe by an on-disk inode. The inode
holds the root header for the extent binary tree. File attribute data and
block allocation maps are also kept in the inode.
Inode Layout
The inode is a 512 byte structure, split into four 128 byte sections
described here.
Inode Layout
Section 1
Section 2
Section 3
Section 4
Section
1
2
3
4
POSIX Attributes
é
é
é
é
extended attributes
block allocation maps
Inode allocation maps
headers describing the inode data
In-line data
or
xad’s
extended attributes
or
more in-line data
or
additional xad’s
Description
This section describes the POSIX attributes of the JFS2 object
including the inode and fileset number, object type, object size,
user id, group id, created, access time, modified time, created
time and more.
This section contains several parts:
•
descriptors for extended attributes
•
block allocation maps
•
inode allocation maps
•
Header pointing to the data (b+-tree root, directory, in-line
data)
This section can contain one of the following:
•
In-line File data - for very small files (up to 128 bytes)
•
The first 8 xad structures describing the extents for this file.
This section extends section 3 by providing additional storage for
more attributes, xad structures or in-line data.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Inodes
Structure
-- continued
The current definition of the on-disk inode structure is
struct dinode{
/* I. base area (128 bytes)
* define generic/POSIX attributes */
ino64_t
di_number; /* 8: inode number, aka file serial number */
uint32
di_gen;
/* 4: inode generation number */
uint32
di_fileset; /* 4: fileset #, inode # of inode map file */
uint32
di_inostamp; /* 4: stamp to show inode belongs to fileset */
uint32
di_rsv1;
/* 4: */
pxd_t
di_ixpxd;
/* 8: inode extent descriptor */
int64
di_size;
/* 8: size */
int64
di_nblocks;
/* 8: number of blocks allocated */
uint32
di_uid;
/* 4: uid_t user id of owner */
uint32
di_gid;
/* 4: gid_t group id of owner */
int32
di_nlink;
/* 4: number of links to the object */
uint32
di_mode;
/* 4: mode_t attribute, format and permission */
j2time_t di_atime;
/* 16: time last data accessed */
j2time_t di_ctime;
/* 16: time last status changed */
j2time_t di_mtime;
/* 16: time last data modified */
j2time_t di_otime;
/* 16: time created */
/* II. extension area (128 bytes)
*
extended attributes for file system (96); */
ead_t
di_ea;
/* 16: ea descriptor */
union {
uint8
_data[80];
/* block allocation map */
struct {
struct bmap *__bmap;
} _bmap;
/* incore bmap descriptor */
/* inode allocation map (fileset inode 1st half) */
struct {
uint32
_gengen;
/* di_gen generator */
struct inode
*__ipimap2; /* replica */
struct inomap
*__imap;
/* incore imap control */
} _imap;
} _data2;
/* B+-tree root header (32)
* B+-tree root node header, or dtroot_t for directory,
* or data extent descriptor for inline data; */
union {
struct {
int32
_di_rsrvd[4];
/* 16: */
dxd_t
_di_dxd;
/* 16: data extent descriptor */
} _xd;
int32
_di_btroot[8]; /* 32: xtpage_t or dtroot_t */
ino64_t
_di_parent;
/* 8: idotdot in dtroot_t */
} _data2r;
/* III. type-dependent area (128 bytes)
* B+-tree root node xad array or inline data */
union {
uint8
_data[128];
/* +-tree root node/inline data area */
struct {
uint8
_xad[128];
} _file;
/* device special file */
struct {
dev64_t
_rdev;
/* 8: dev_t device major and minor */
} _specfile;
}
/* symbolic link.
* link is stored in inode if its length is less than
* IDATASIZE. Otherwise stored like a regular file. */
struct {
uint8
_fastsymlink[128];
} _symlink;
} _data3;
/* IV. type-dependent extension area (128 bytes)
* user-defined attribute, or
* inline data continuation, or
* B+-tree root node continuation */
union {
uint8
_data[128];
} _data4;
Continued on next page
-16 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Inodes
Allocation
Policy
Guide
-- continued
JFS2 allocates inodes dynamically, which provides the following
advantages:
• Allows placement of inode disk blocks at any disk address, which
decouples the inode number from the location. This decoupling
simplifies supporting aggregate and fileset reorganization to enable
shrinking the aggregate. The inodes can be moved and still retain the
same number, which allows us to not need to search the directory
structure to update the inode numbers.
• There is no need to allocate “ten times as many inodes as you will ever
need”, as with filesystems that contain a fixed number of inodes, and
thus filesystem space utilization is optimized. This is especially
important with the larger inode size of 512 bytes in JFS2.
• File allocation for large files can consume multiple allocation groups and
still be contiguous. Static allocation forces a gap containing the initially
allocated inodes in each allocation group, with dynamic allocation, all
the blocks contained in an allocation group can be used for data.
Dynamic inode allocation causes a number of problems, including:
• With static allocation the geometry of the file system implicitly describes
the layout of inodes on disk. With dynamic allocation separate mapping
structures are required.
• The inode mapping structures are critical to JFS2 integrity. Due to the
overhead involved in replicating these structures we accept the risk of
losing these maps. However, replicating the B+–tree structures allows
us to find the maps.
Inode extents
Inodes are allocated dynamically by allocating inode extents that are
simply a contiguous chunk of inodes on the disk. By definition, a JFS2
inode extent contains 32 inodes. With a 512 byte inode size, an inode
extent is therefore occupies 16KB on the disk.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Inodes
-- continued
Inode
initialization
When a new inode extent is allocated the extent is not initialized, but in
order for fsck to be able to check if an inode is in-use, JFS2 will need some
information in it. Once an inode in an extent is marked in-use its fileset
number, inode number, inode stamp, and the inode allocation group block
address are initialized. Thereafter, the link field will be sufficient to
determine if the inode is currently in-use.
Inode
Allocation Map
Dynamic inode allocation implies that there is no direct relationship
between an inode number and the disk address of the inode. Therefore we
must have a means of finding the inodes on disk. The Inode Allocation
Map provides this function.
Inode
Generation
Numbers
Inode generation numbers are simply counters that will increment each
time an inode is reused. Network file system protocols such as NFS
(implicitly) require them; they form part of the file identifier manipulated by
VNOP_FID() and VFS_VGET().
The static-inode-allocation practice of storing a per-inode generation
counter doesn’t work with dynamic inode allocation, because when an
inode becomes free its disk space may literally be reused for something
other than an inode (e.g., the space may be reclaimed for ordinary file data
storage). Therefore, in JFS2 there is simply one inode generation counter
that is incremented on every inode allocation (rather than one counter per
inode that would be incremented when that inode is reused).
Although a fileset-wide generation counter will recycle faster than a perinode generation counter, a simple calculation shows that the 32-bit value
is still sufficient to meet NFS or DFS requirements.
-18 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
File Data Storage
Overview
This section introduces the data structures used to describe where a file’s
data is stored.
In-line data
If a file contains small amounts of data the data may be stored in the inode
its self. This is called in-line storage. The header found in the second
section of the inode points to the data that is stored in the third and fourth
section of the inode.
inode
Inode Info
In-line data
Header for in-line data
Binary trees
When more storage is needed than can be provided in-line the data must
be placed in extents. The header in the inode now becomes the binary
tree root header. If there are 8 or fewer extents for the file, then the xad
structures describing the extents are contained in the inode. An inode
containing less than 8 xad structures would look like:
inode
68
Inode Info
16KB
Data
xad entries
(8 total)
B+-tree header
offset:
addr:
length:
offset:
addr:
length:
0
68
4
4096
84
12
4096
48KB
Data
offset: 26624
addr: 256
length: 2
26624
8KB Data
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
File Data Storage
-- continued
INLINEEA bit
Once the 8 xad structures in the inode are filled, an attempt is made to use
the last quadrant of the inode for more xad structures. If the INLINEEA bit
is set in the di_mode field of the inode, then the last quadrant of the inode
is available for 8 more xad structures.
More extents
Once all of the available xad structures in the inode are used, the B+–tree
must be split. 4K of disk space is allocated for a leaf node of the B+–tree,
which is logically an array of xad entries with a header. The 8 xad entries
are moved from the inode to the leaf node, and the header is initialized to
point to the 9th entry as the first free entry. The first xad structure in the
inode is updated to point to the newly allocated leaf node, and the inode
header is updated to indicate that only one xad structure is now being
used, and that it contains the pure root of a B+-tree. The offset for this new
xad structure contains the offset of the first entry in the leaf node.
The organization of the inode now look like:
412
inode
Inode Info
xad entries
(8 total)
offset:
addr:
length:
offset:
addr:
length:
0
412
4
0
0
0
254 xad leaf node entries
B+-tree header
68
header
16KB
Data
4096
48KB
Data
offset: 0
addr: 0
length: 0
26624
8KB Data
Continued on next page
-20 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
File Data Storage
Continuing to
add extents
-- continued
As new extents are added to the file, they continue to be added to the leaf
node in the necessary order, until the node fills. Once the node fills a new
4K of disk space is allocated for another leaf node of the B+–tree, and the
second xad structure from the inode is set to point to this newly allocated
node. The node now looks like:
412
inode
Inode Info
xad entries
(8 total)
offset:
addr:
length:
offset:
addr:
length:
0
412
4
750
560
4
16KB
Data
254 xad leaf node entries
B+-tree header
68
header
4096
48KB
Data
offset: 0
addr: 0
length: 0
560
header
254 xad leaf node entries
26624
8KB Data
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
File Data Storage
Another split
-- continued
As extents are added to the inode, this behavior continues until all 8 xad
structures in the inode contain leaf node xad structures, at which time
another split of the B+–tree will occur. This split creates an internal node of
the B+–tree which is used purely to route the searches of the tree. An
internal node looks exactly like a leaf node. 4K of disk space is allocated
for the internal node of the B+–tree., the 8 xad entries of the leaf nodes are
moved from the inode to the newly created internal node, and the internal
node header is initialized to point to the 9th entry as the first free entry. The
root of the B+–tree is then updated by making the inode’s first xad
structure point to the newly allocated internal node, and the header in the
inode is updated to indicate that now only 1 xad structure is being used for
the B+–tree.
As extents continue to be added, additional leaf nodes are created to
contain the xad structures for the extents, and these leaf nodes are added
to the internal node.
Once the first internal node is filled, a second internal node is allocated,
the inode’s second xad structure is updated to point to the new internal
node.
This behavior continues until all 8 of the inode’s xad structures contain
internal nodes.
380
inode
Inode Info
xad entries
(8 total)
0
380
4
8340
212
4
68
16KB
Data
254 xad leaf node entries
offset:
addr:
length:
offset:
addr:
length:
412
header
254 xad internal node entries
B+-tree header
header
4096
48KB
Data
offset: 0
addr: 0
length: 0
212
header
254 xad internal node entries
254 xad leaf node entries
-22 of 36 AIX 5L Internals
560
header
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
26624
8KB Data
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
fsdb Utility
Introduction
The fsdb command enables you to examine, alter, and debug a file
system.
Starting fsdb
It is best to run fsdb against an unmounted filesystem. Use the following
syntax to start fsdb:
fddb <path to logical volume>
For example:
# fsdb
/dev/lv00
Aggregate Block Size: 512
>
Supported
filesystems
fsdb supports both the JFS and JFS2 file systems. The commands
available in fsdb are different depending on what filesystem type it is
running against. The following explains how to use fsdb with a JFS2 file
system.
Commands
The commands available in fsdb can be viewed with the help command
as shown here.
> help
Xpeek Commands
a[lter] <block> <offset> <hex string>
b[tree] <block> [<offset>]
dir[ectory] <inode number> [<fileset>]
d[isplay] [<block> [<offset> [<format> [<count>]]]]
dm[ap] [<block number>]
dt[ree] <inode number> [<fileset>]
h[elp] [<command>]
ia[g] [<IAG number>] [a | <fileset>]
i[node] [<inode number>] [a | <fileset>]
q[uit]
su[perblock] [p | s]
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Exercise 1 - fsdb
Introduction
In this lab you will run the fsdb utility against a JFS2 filesystem that was
created for you. The filesystem should not be mounted when running
fsdb. The filesystem may be mounted to examine the files, just be sure to
un-mount it before running fsdb.
Lab steps
Follow the steps in this table:
Step
1
Action
Start fsdb on the logical volume /dev/lv00
# fsdb
1
2
/dev/lv00
What is the aggregate block size used in this filesystem.
Type help to view the fsdb sub-commands. The
commands you will be using in this lab are: inode,
directory and display
What inode number represents the fileset root directory
inode?
Display the root inode for the file set. What command did
you use?
3
Note: If you want to display the aggregate inodes instead of
the fileset inode append an “a” to the command i.e.: inode
2 a.
Find the inode number of each file in the fileset using the
directory command followed by the inode number of the
root directory inode of the fileset. For example:
> dir 2
idotdot = 2
4
fileA
5
fileB
6
fileC
3
lost+found
Continued on next page
f
-24 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Exercise 1 - fsdb
Using fsdb continued
-- continued
In the next few steps you will locate and display the fileA’s data.
Step
4
Action
Display the inode of fileA, what command did you use?
Use the inode you displayed to answer the following
questions:
What is the file size of fileA?
How many disk blocks is fileA’s data using?
5
After the inode is displayed a sub-menu of commands is
shown. Type a t to display the root binary tree header.
Examine the flags in the header, what flags are set?
6
Type <enter> to walk down the xad structures in this
node. How many xad structures are used for this file?
7
The address field in the xad shows the aggregate block
number of the first data block of fileA. Use the display
command to display this block.
> d 12345
Did you find fileA’s data?
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Exercise 1 - fsdb
FileB and fileC
-- continued
Use the commands and techniques you learned in the last section to
examine fileB, fileC and fileD. Answer the following questions about these
files:
1. What number inodes are used for fileB, fileC and fileD?
2. How many xad structures are used to describe fileB’s data blocks?
3. How many xad structures are used to describe fileC’s data blocks?
4. Examine the inode for fileD. How big is this file (as shown in di_size)?
How many aggregate blocks are being used by fileD?
Are enough aggregate blocks allocated to store the entire file? Explain
your answer.
-26 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Directory
Introduction
In addition to files an inode can represent a directory. A Directory is a
journaled meta-data file in JFS2, and is composed of directory entries
which indicate the files and sub-directories contained in the directory.
Directory entry
Stored in an array the directory entries links the names of the objects in the
directory to an inode number. The directory entry has the following
members.
Member
inumber
namelen
name[30]
next
Description
Inode number
Length of the name.
File name, up to 30 characters.
If more that 30 characters are
needed additional entries are link
using the next pointer
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Directory
Root Header
-- continued
In order to improve performance of locating a specific directory entry a
binary tree sorted by name is used. As with files, the header section of a
directory inode contains the binary tree root header. Each header
describes an 8 element array of directory entries. The root header is
defined by a dtroot_t structure contained in /usr/include/j2/j2_dtree.h:
typedef union {
struct {
ino64_t
int64
uint8
int8
int8
int8
int32
int8
} header;
dtslot_t
} dtroot_t;
Member
idotdot
flag
nextindex
freecnt
freelist
stbl[8]
slot[9]
Leaf and
internal node
header
idotdot;
rsrvd1;
flag;
nextindex;
freecnt;
freelist;
rsrvd2;
stbl[8];
/*
/*
/*
/*
/*
/*
/*
/*
/*
8: parent inode number */
8: */
1: */
1: next free entry in stbl */
1: free count */
1: freelist header */
4: */
8: sorted entry index table */
(32) */
slot[9];
Description
Inode number of parent directory.
indicating if the node is an internal or leaf node, and
whether it is the root of the binary tree.
last used slot in the directory entry slot array.
number of free slots in the directory entry array.
slot number of the head of the free list
indices to the directory entry slots that are currently in
use. The entries are sorted alphabetically by name.
Array of directory entries. 8 entries, The header is
stored in the first slot.
When more than 8 directory entries are needed a leaf or internal node is
added. The directory internal and leaf node headers are similar to root
node header except that up to 128 directory entries. The page header is
defined by a dpage_t structure contained in /usr/include/j2/j2_dtree.h.
Continued on next page
-28 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Directory
-- continued
Directory slot
array
The Directory Slot Array (stbl[]) is a sorted array, of indices to the directory
slots that are currently in use. The entries are sorted alphabetically by
name. This limits the amount of shifting necessary when directory entries
are added or deleted, since the array is much smaller than the entries
themselves. A binary search can be used on this array to search for
particular directory entries.
In this example the directory entry table contains four files. The stbl table
contains the slot numbers of the entries ordering the entries alphabetically.
Directory Entry
table
1
2
3
4
def
abc
xyz
STBL[8]
2
1
4
3
0
0
0
0
hij
5
6
7
8
. and ..
A directory does not contain specific entries for self (“.”) and parent (“..”).
Instead these will be represented in the inode itself. Self is the directory’s
own inode number, and the parent inode number is held in the “idotdot”
field in the header.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Directory
Growing
directory size
-- continued
As the number of files in the directory grow the directory tables must be
increase in size. This table describes the steps used.
Step
1
2
3
4
5
-30 of 36 AIX 5L Internals
Action
Initial directory entries are stored in directory inode in-line
data area.
When the in-line data area of the directory inode becomes
full JFS2 allocates a leaf node the same size as the
aggregate block size.
When that initial leaf node becomes full and the leaf node is
not yet 4K, double the current size. First attempt to double
the extent in place, if there is not room to do this a new
extent must be allocated and the data from the old extent
must be copied to the new extent. The directory slot array
will only have been big enough to reference enough slots for
the smaller page so a new slot array will have to be created.
Use the slots from the beginning of the newly allocated
space for the larger array and copy the old array data to the
new location. Update the header to point to this array and
add the slots for the old array to the free list.
If the leaf node again becomes full and is still not 4K repeat
step 3. Once the leaf node reaches 4K allocate a new leaf
node. Every leaf node after the initial one will be allocated
as 4K to start.
When all entries are free in a leaf page, the page will be
removed from the B+–tree. When all the entries in the last
leaf page are deleted, the directory will shrink back into the
directory inode in-line data area.
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Directory Examples
Introduction
This sections demonstrates how the directory structures change over time.
Small
Directories
Initial directory entries are stored in directory inode in-line data area.
Examine this example of a small directory. In this example all the inode
information fits into the in-line data area:
# ls -ai
69651 .
2 ..
69652 foobar1
69653 foobar12
69654 foobar3
69655 longnamedfilewithover22charsinitsname
flag: BT_ROOT BT_LEAF
nextindex: 4
freecnt: 3
freelist: 6
idotdot: 2
stbl: {1,2,3,4,0,0,0}
1
inumber: 69652
next: -1
namelen: 7
name: foobar1
2
inumber: 69653
next: -1
namelen: 8
name: foobar12
3
inumber: 69654
next: -1
namelen: 7
name: foobar2
4
inumber: 69655
next: 5
namelen: 37
name:longnamedfilewithover2
5
next: -1
cnt: 0
name: 2charsinitsname
Note: the file with a long name has its name split across two slots.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Directory Examples
Adding a file
-- continued
An additional file called “afile” is created. The details for this file are added
at the next free slot (slot 6). As this is now, alphabetically, the first file in
the directory, the search table array (stbl[]) is re-organized, such that the
entry in slot 6 is now in the first entry.
# ls -ai
69651 .
2 ..
69656 afile
69652 foobar1
69653 foobar2
69654 foobar3
69655 longnamedfilewithover22charsinitsname
flag: BT_ROOT BT_LEAF
nextindex: 5
freecnt: 2
freelist: 7
idotdot: 2
stbl: {6,1,2,3,4,0,0,0}
1
inumber: 69652
next: -1
namelen: 7
name: foobar1
2
inumber: 69653
next: -1
namelen: 8
name: foobar12
3
inumber: 69654
next: -1
namelen: 7
name: foobar2
4
inumber: 69655
next: 5
namelen: 37
name:longnamedfilewithover2
5
next: -1
cnt: 0
name: 2charsinitsname
6
inumber: 69656
next: -1
namelen: 5
name: afile
Continued on next page
-32 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Directory Examples
Adding a leaf
node
-- continued
When the directory grows to where there are more entries than can be
stored in the in-line data area of the inode then JFS2 allocates a leaf node
the same size as the aggregate block size. The in-line entries are moved
to a leaf node as illustrated.
Block 52
flag: BT_ROOT BT_INTERNAL
nextindex: 1
freecnt: 7
freelist: 2
idotdot: 2
stbl: {1,2,3,4,5,6,7,8}
flag: BT_LEAF
nextindex: 20
freecnt: 103
freelist: 25
maxslot: 128
stbl: {1,2,15, ... 8,13,14}
1
1
inumber: 5
next: -1
namelen: 5
name: file0
2
inumber: 6
next: -1
namelen: 5
name: file1
3
inumber: 15
next: -1
namelen: 6
name: file10
19
inumber: 23
next: -1
namelen: 6
name: file18
inumber: 24
next: -1
namelen: 6
name: file19
xd.len: 1
xd.addr1: 0
xd.addr2: 52
next: -1
namelen: 0
name: file0
20
Once the leaf is full, an internal node is added at the next free in-line data
slot in the inode, which will contain the address of the next leaf node.
Note: the internal node entry, contains the name of the first file (in
alphabetical order) for that leaf node.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Directory Examples
Adding a
internal node
-- continued
Once all the in-line slots have been filled by internal nodes, a separate
node block is allocated, the entries from the in-line data slots are moved to
this new node, and the first in-line data slot updated with the address of the
new internal node.
Block 118
Block 52
flag: BT_ROOT BT_INTERNAL
nextindex: 4
freecnt: 4
freelist: 5
idotdot: 2
stbl: {1,3,4,2,6,7,2,8}
flag: BT_INTERNAL
nextindex: 64
freecnt: 59
freelist: 76
maxslot: 128
stbl: {1,19,18, ... 7,8}
flag: BT_LEAF
nextindex: 64
freecnt: 59
freelist: 21
maxslot: 128
stbl: {1,2,15 ... 113,112}
1
1
xd.len: 1
xd.addr1: 0
xd.addr2: 52
next: -1
namelen: 0
name: file0
1
inumber: 5
next: -1
namelen: 5
name: file0
2
2
xd.len:
xd.addr1:
xd.addr2:
next:
namelen:
name:
inumber: 6
next: -1
namelen: 5
name: file1
3
inumber: 15
next: -1
namelen: 6
name: file10
126
inumber: 10057
next: -1
namelen: 9
name: file10052
inumber: 10041
next: -1
namelen: 9
name: file10036
2
xd.len: 1
xd.addr1: 0
xd.addr2: 118
next: -1
namelen: 0
name: file0
xd.len: 1
xd.addr1: 0
xd.addr2: 1204
next: -1
namelen: 8
name: file4845
3
xd.len: 1
xd.addr1: 0
xd.addr2: 1991
next: -1
namelen: 9
name: file13833
126
4
xd.len: 1
xd.addr1: 0
xd.addr2: 2609
next: -1
namelen: 8
name: file17723
127
xd.len: 0
xd.addr1: -1
xd.addr2: 1473
next: -1
namelen: 8
name: file1472
xd.len: 1
xd.addr1: 0
xd.addr2: 1472
next: -1
namelen: 8
name: file1017
127
After many extra files have been added to the directory, two layers of
internal nodes are required to reference all the files.
Note: now, that the internal node entries in the inode contain the name of
the alphabetically first entry referenced by each of the second level internal
nodes, and each entry in these references the name of the alphabetically
first entry in each leaf node.
-34 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Guide
Exercise 2 - Directories
Introduction
In this exercise you will use the fsdb utility to examine directory inodes in
a jfs2 filesystem.
Small
directories
Run fsdb on the sample filesystem. Use the following steps to examine
the directory node for /mnt/small.
Step
1
2
3
4
Action
Find the inode for directory small:
> dir 2
Display the inode found in the last step.
> i <inode number>
Using the t sub-command display the directory node root
header.
Is this header a root, internal or leaf header?
Type <enter> to display the directory entries. Repeat
<enter> until all the entries are displayed.
How many files are in the directory?
5
Examine the directory slot array stbl[] (displayed in the
header).
What file name is associated with the first slot entry?
6
7
Exit fsdb and mount the filesystem.
# mount /mnt
Create the file /mnt/small/a
#touch /mnt/small/a
Predict what the stbl[] table for directory small will look like
now?
8
Un-mount the filesystem run fsdb and check your
prediction.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-35 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, j2-1.fm
Exercise 2 - Directories
Larger
directories
-- continued
In this section you will examine the directory node structures for some
larger directories.
Step
1
Action
What is the inode for the directory called medium?
2
Display the inode and look at the root tree header. The
flags should indicate that this is an internal header. One
entry should be found for each leaf node. Display the
entries with the <enter> key. How many leaf nodes are
their?.
3
Use the down sub command to display the first leaf node
header. How many entries is this header currently
describing?
What is the maximum number of entries (files) that be
described by a single leaf node?
4
Examine the big directory and answer the following
questions.
How many internal leaf nodes in big?
How many files in big?
-36 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
Unit 13. Logical and Virtual File Systems
Objectives
After completing this unit, you should be able to
• Identify the various compoints that make up the logical and virtual
• To use the debugger (kdb/iadb) to display these components.
References
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
General File System Interface
Introduction
This lesson covers the interface and services that AIX 5L provides to
physical filesystem. The Logical File System (LFS), Virtual File System
(VFS) and the interface between these compoints and physical file
systems are discussed in this lesson.
Supported file
systems
Using the structure of the logical file system and the virtual filesystem AIX
5L can support a number of different file system types transparently to
application programs. These file systems reside below the LFS/VFS and
operate relatively independently of each other. Currently AIX 5L supports
the following physical filesystem implementations:
• Enhanced Journaled Filesystem (JFS2)
• Journaled filesystem (JFS)
• Network File System (NFS)
• A CD-ROM File system which supports ISO-9660, High Sierra and
Rock Ridge formats.
Extensible
The LFS/VFS interface also provides a relatively easy means by which
third party filesystem types can be added without any changes to the LFS.
Hierarchy
Access to files and directories by a process is controlled by the various
layers in the AIX 5L kernel as illustrated here.
é
é
é
é
é
é
System calls
Logical File System (LFS)
Virtual File System (VFS)
File System Implementation Support of individual file system
layout.
Fault Handler - Device page fault
handler support in the VMM.
Device Driver - Actual device
driver code to interface with the
device. It is invoked by the page
fault handler when the file system
implementation code maps the
opened file to kernel memory and
reads the mapped memory. LVM
is the device driver for J2 and
Journalled filesystems.
System Call
Logical File System
Virtural File System
File System
Implementation
Fault Handler
Device Driver
Device
Continued on next page
-2 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
General File System Interface
Internal data
structures
-- continued
This illustration shows the major data structures that will be discussed in
this lesson. This illustration is repeated throughout the lesson
highlighting the areas being discussed.
u-block
inode
vnode
gnode
User File
Descriptor
Table
System File
Table
vnodeops
vfs
gfs
vmount
vfsops
Logical File System
© Copyright IBM Corp. 2000
Virtural File System
(Vnode-VFS Interface)
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
File System
-3 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Logical File System
Overview
The Logical File System (LFS) provides a consistent programming
interface to applications via the system call interface, with calls such as
open(), close(), read() and write(). The LFS breaks down each
system call into requests for the underlying file system implementations.
LFS Data
Structures
The data structures discussed in this section are the System Open File
Table and the User File Descriptor Table. The system open file table has
one entry for each open file on the system. The user file descriptor table
(one per process) contains entries for each of the process open file...
u-block
inode
vnode
gnode
User File
Descriptor
Table
System File
Table
vnodeops
vfs
gfs
vmount
vfsops
Logical File System
Operations
Virtural File System
(Vnode-VFS Interface)
File System
The LFS provides a standard set of operations to support the system call
interface, its routines manage the open file table entries and the perprocess file descriptors. It provides:
• the User File Descriptor Table.
• the System File table. An open file table entry records the authorization
of a process’s access to a file system object.
The LFS abstraction specifies the set of file system operations that an
implementation must include in order to carry out logical file system
requests. Physical file systems can differ in how they implement these
predefined operations, but they must present a uniform interface to the
LFS. It supports UNIX-like file system access semantics, but other nonUNIX file systems can also be supported.
Continued on next page
-4 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Logical File System
User interface
Guide
-- continued
A user can refer to an open file table entry through a file descriptor held in
the thread’s ublock, or by accessing the virtual memory to which the file
was mapped. The file descriptor table entry is created when the file is
initially opened, via the open() system call and will remain until either the
user closes the file via the close() system call, or the process terminates.
The LFS is the level of the file system at which users can request file
operations by using system calls, such as open(), close(), read(), write()
etc. For all these calls (except open()), the file descriptor number is passed
as an argument to the call. The system calls implement services that are
exported to users, and provide a consistent user mode programming
interface to the LFS that is independent of the underlying file system type.
System calls that carry out file system requests:
• Map the user’s parameters to a file system object. This requires that the
system call component use the vnode (virtual node) component to
follow the object’s path name. In addition, the system call must resolve
a file descriptor or establish implicit (mapped) references using the open
file component.
• Verify that a requested operation is applicable to the type of the
specified object.
• Dispatch a request to the file system implementation to perform
operations.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
User File Descriptor Table
Description
The user file descriptor table, is contained in the user area, and is a per
process resource. Each entry references an open file, device, or socket
from the process’ perspective. The index into the table for a specific file, is
the value returned by the open() system call when the file is opened - the
file descriptor.
Table
Management
One or more slots of the file descriptor tables are used for each open file.
The file descriptor table can extend beyond first page of the ublock, and is
page-able. There is a fixed upper limit of 32768 open file descriptors per
process (defined as OPEN_MAX in /usr/include/sys/limits.h). This value is
fixed, and may not changed.
User File
Descriptor
Table structure
The user file descriptor table consists of an array of user file descriptor
table structures defined in /usr/include/sys/user.h in the structure ufd:
struct ufd {
struct file *
fp;
unsigned short flags;
unsigned short count;
#ifdef __64BIT_KERNEL
unsigned int
reserved;
#endif /* __64BIT_KERNEL */
};
-6 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
System File Table
Description
The system file table is a global resource, and is shared by all processes
on the system. One unique entry is allocated for each unique open of a file,
device, or socket in the system.
Table
Management
The table is a large array, and is partly initialized. It grows on demand, and
is never shrunk. Once entries are freed, they are added back onto the free
list (ffreelist). The table can contain a maximum of 1,000,000 entries, and
is not configurable.
Table entries
The file table array consists of struct file data elements. Several of the
key members of this data structure are described in this table.
Member
Description
f_count
A reference count field detailing the current
number of opens on the file. This value is
increased each time the file is opened, and
decremented on each close(). Once the
reference count is zero, the slot is considered
free, and may be re-used.
f_flag
various flags described in fcntl.h
f_type
a type field describing the type of file:
/* f_type values */
#define DTYPE_VNODE 1 /* file */
#define DTYPE_SOCKET 2 /* communications endpoint */
#define DTYPE_GNODE 3 /* device */
#define DTYPE_OTHER -1 /* unknown */
f_offset
f_data
f_ops
a read/write pointer.
Defined as f_up.f_uvnode it is a pointer to
another data structure representing the object
typically the vnode structure.
a structure containing pointers to functions for
the following file operations: rw (read/write),
ioctl, select, close, fstat.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
System File Table
file structure
-- continued
The file table structure is described in /usr/include/sys/file.h
struct file {
long
f_flag;
/* see fcntl.h */
int
f_count;
/* reference count */
short
f_options; /* file flags not passed through vnode layer */
short
f_type;
/* descriptor type */
union {
struct vnode
*f_uvnode;
/* pointer to vnode structure */
struct file
*f_unext;
/* next entry in freelist */
} f_up;
offset_t f_offset; /* read/write character pointer */
off_t
f_dir_off; /* BSD style directory offsets */
union {
struct ucred *f_cpcred;
/* process credentials at open() */
struct file *f_cpqmnext; /* next quick move chunk on free list*/
} f_cp;
Simple_lock f_lock;
/* file structure fields lock */
Simple_lock f_offset_lock;
/* file structure offset field lock */
caddr_t f_vinfo;
/* any info vfs needs */
struct fileops
{
int
(*fo_rw)();
int
(*fo_ioctl)();
int
(*fo_select)();
int
(*fo_close)();
int
(*fo_fstat)();
} *f_ops;
};
-8 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
Virtual File System
Overview
The Virtual FIle System (VFS) defines a standard set of operations on an
entire file system. Operations preformed by a process on a file or file
system are mapped through the VFS to the file system below. In this way,
the process need not know the specifics of different file systems (such as
JFS, J2, NFS or CDROM).
Data
Structures
The data structures within a virtual file system are:
• vnode - one per file
• gfs - one per filesystem type kernel extension.
• vnodeops - one per filesystem type kernel extension.
• vfsops - one per filesystem type kernel extension.
• vfs - one per mounted filesystem.
• vmount - one per mounted filesystem.
u-block
inode
vnode
gnode
User File
Descriptor
Table
System File
Table
vnodeops
vfs
gfs
vmount
vfsops
Logical File System
Functional
sections
Virtural File System
(Vnode-VFS Interface)
File System
For the purpose of this lesson the VFS will be broken into three sections
and described separately. These sections are:
• Vnode-VFS interface
• File and File System Operations
• The gnode
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Vnode/vfs interface
Overview
The interface between the logical file system and the underlying file
system implementations is referred to as the vnode/vfs interface. This
interface provides a logical boundary between generic objects understood
at the LFS layer and the file system specific objects that the underlying file
system implementation must manage such as inodes and super blocks.
The LFS is relatively unaware of the underlying file system data structures
since they can be radically different for the various file system types.
Data
Structures
Vnodes and vfs structures are the primary data structures used to
communicate through the interface (with help from the vmount).
• vnodes - represents a files
• vfs - represents a mounted file system
• vmount - contains specifics of the mount request.
u-block
inode
vnode
gnode
User File
Descriptor
Table
System File
Table
vnodeops
vfs
gfs
vmount
vfsops
Logical File System
History
Virtural File System
(Vnode-VFS Interface)
File System
The vnode and vfs structures of the LFS was created by Sun Micro
Systems and has evolved into a de-facto industry standard, thanks in part
to NFS.
-10 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
Vnodes
Overview
The vnode provides a standard set of operations within the file system,
and provides system calls with a mechanism for local name resolution.
This allows the logical file system to access multiple file system
implementations through a uniform name space.
Detail
Vnodes are the primary handles by which the operating system references
files, and represent access to an object within a virtual file system. Each
time an object (file) within a file system is located (even if it is not opened),
a vnode for that object is located (if already in existence), or created, as
are the vnodes for any directory that has to be searched to resolve the
path to the object.
As a file is created, a vnode is also created, and will be re-used for every
subsequent reference made to the file by a path name. Every path name
known to the logical file system can be associated with, at most, one file
system object, and each file system object can have several names
because it can be mounted in different locations. Symbolic links and hard
links to an object always get the same vnode if accessed through the same
mount point.
vnode
Management
Vnodes are created by the vfs-specific code when needed, using the
vn_get kernel service. Vnodes are deleted with the vn_free kernel service.
Vnodes are created as the result of a path resolution.
structure
The vnode is structure is defined in /usr/include/sys/vnode.h
struct vnode {
ushort
ulong32int64
int
Simple_lock
struct vfs
struct vfs
v_flag;
v_count;
v_vfsgen;
v_lock;
*v_vfsp;
*v_mvfsp;
struct gnode *v_gnode;
struct vnode *v_next;
/* the use count of this vnode */
/* generation number for the vfs */
/* lock on the structure */
/* pointer to the vfs of this vnode */
/* pointer to vfs which was mounted over * /
/* this vnode; NULL if not mounted */
/* ptr to implementation gnode */
/* ptr to other vnodes that share same gnode
*/
struct vnode *v_vfsnext; /* ptr to next vnode on list off
struct vnode *v_vfsprev; /* ptr to prev vnode on list off
union v_data {
void *
_v_socket;
/* vnode associated data
struct vnode * _v_pfsvnode; /* vnode in pfs for spec
} _v_data;
char * v_audit;
/* ptr to audit object
of vfs */
of vfs */
*/
*/
*/
};
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
vfs and vmount
Description
When new file systems are mounted, a vfs and vmount structures are
created. The vmount structure contains specifics of the mount request,
such as the object being mounted, and the stub over which it is being
mounted. The vfs structure is the connecting structure which links the
vnodes (representing files) with the vmount information, and the gfs
structure that help define the operations that can be performed on the
filesystem and its files.
vfs
The vfs structure is the connecting structure which links the vnodes
(representing files) with the vmount information, and the gfs structure witch
provides a path to the operations that can be performed on the filesystem
and its files.
Element
*vfs_next
*vfs_gfs
vfs_mntd
vfs_mntdover
vfs_nodes
vfs_mdata
Description
vfs’s are a linked list with the first vfs entry
addressed by the rootvfs variable which is
private to the kernel.
path back to the gfs structure and its file
system specific subroutines through the
vfs_gfs pointer.
The vfs_mntd pointer points to the vnode
within the file system which generally
represents the root directory of the file
system.
The vfs_mntdover pointer points to a vnode
within another file system, also usually
representing a directory, which indicates
where the file system is mounted. In this
sense, the vfs_mntd pointer corresponds to
the object within the vmount structure
referenced by the vfs_mdata pointer, and
the vfs_mntdover pointer corresponds to the
stub within the vmount structure referenced
by the vfs_mdata pointer.
Pointer to all vnodes for this file system.
Pointer to the vmount providing mount
information for this filesystem
Continued on next page
-12 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
vfs and vmount
vfs structure
-- continued
The vfs structure is defined in /usr/include/sys/vfs.h:
struct vfs {
struct vfs
struct gfs
struct vnode
struct vnode
struct vnode
int
caddr_t
unsigned int
between */
int
short
unsigned short
struct vmount
Simple_lock
*vfs_next;
*vfs_gfs;
*vfs_mntd;
*vfs_mntdover;
*vfs_vnodes;
vfs_count;
vfs_data;
vfs_number;
vfs_bsize;
vfs_rsvd1;
vfs_rsvd2;
*vfs_mdata;
vfs_lock;
/*
/*
/*
/*
/*
/*
/*
/*
vfs’s are a linked list */
ptr to gfs of vfs */
pointer to mounted vnode */
pointer to mounted-over vnode */
all vnodes in this vfs */
number of users of this vfs */
private data area pointer */
serial number to help distinguish
/* different mounts of the same object */
/* native block size */
/* Reserved */
/* Reserved */
/* record of mount arguments */
/* lock to serialize vnode list */
};
vfs
Management
The mount helper creates the vmount structure, and calls the vmount
subroutine. The vmount subroutine then creates the vfs structure, partially
populates it, and invokes the file system dependent vfs_mount subroutine
which completes the vfs structure, and performs any operations required
internally by the particular file system implementation.
There is one vfs structure for each file system currently mounted. New vfs
structures are created with the vmount subroutine. This subroutine calls
the vfs_mount subroutine found within the vfsops structure for the
particular virtual file system type. The vfs entries are removed with the
uvmount subroutine. This subroutine calls the vfs_umount subroutine from
the vfsops structure for the virtual file system type.
vmount
The vmount structure contains specifics of the mount request. The vmount
structure is defined in /usr/include/sys/vmount.h
struct vmount {
uint
vmt_revision;
uint
vmt_length;
fsid_t vmt_fsid;
int
vmt_vfsnumber;
uint
vmt_time;
uint
vmt_timepad;
int
vmt_flags;
/*
/*
/*
/*
/*
/*
/*
/*
/*
I
I
O
O
O
O
I
O
I
revision level, currently 1
total length of structure & data
id of file system
unique mount id of file system
time of mount
(in future, time is 2 longs)
general mount flags
MNT_REMOTE is output only
type of gfs, see MNT_XXX above
int
vmt_gfstype;
struct vmt_data {
short vmt_off; /* I offset of data, word aligned
short vmt_size; /* I actual size of data in bytes
} vmt_data[VMT_LASTINDEX + 1];
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
};
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
File and Filesystem Operations
Overview
Each file system type extension provides functions to perform operations
on the filesystem and its files. Pointers to these functions are stored in the
vfsops (filesystem operations) and vnodeops (file operations) structures.
Data
Structures
The data structures discussed in this section are:
• gfs - Holds pointers to the vnodeops and the vfsops structures
• vnodeops - contains pointers to filesystem dependent operations on
files (open, close, read, write...).
• vfsops - contains pointers to filesystem dependent operations on the
filesystem (mount, umount...)
u-block
inode
vnode
gnode
User File
Descriptor
Table
System File
Table
vnodeops
vfs
gfs
vmount
vfsops
Logical File System
-14 of 26 AIX 5L Internals
Virtural File System
(Vnode-VFS Interface)
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
File System
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
gfs
Description
There is one gfs structure for each type of virtual file system currently
installed on the machine. For each gfs entry, there may be any number of
vfs entries.
Purpose
The operating system uses the gfs entries as an access point to the virtual
file system functions on a type-by-type basis. There is no direct link from a
gfs entry to all of the vfs entries of a particular gfs type. The file system
code generally uses the gfs structure as a pointer to the vnodeops
structure and the vfsops structure for a particular type of file system.
gfs
management
The gfs structures are stored within a global array accessible only by the
kernel. The gfs entries are inserted with the gfsadd() kernel service, and
only one gfs entry of a given gfs_type can be inserted into the array.
Generally, gfs entries are added by the CFG_INIT section of the
configuration code of the file system kernel extension. The gfs entries are
removed with the gfsdel()kernel service. This is usually done within the
CFG_TERM section of the configuration code of the file system kernel
extension.
gfs structure
The gfs structure is defined in /usr/include/sys/gfs.h
struct gfs {
struct vfsops
struct vnodeops
int
char
int
int
caddr_t
int
int
*gfs_ops;
*gn_ops;
gfs_type;
gfs_name[16];
(*gfs_init)();
/*
/*
/*
/*
/*
/*
type of gfs (from vmount.h) */
name of vfs (eg. "jfs","nfs")*/
( gfsp ) - if ! NULL, */
called once to init gfs */
flags for gfs capabilities */
gfs private config data*/
gfs_flags;
gfs_data;
(*gfs_rinit)();
gfs_hold
/* count of mounts */
}
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
vnodeops
Description
The vnodeops structure contains pointers to the filesystem dependant
operations that can be performed on the vnode, such as link, mkdir,
mknod, open, close, remove.
vnodeops
management
There is one vnodeops structure per filesystem kernel extension loaded
(i.e. one per unique filesystem type), and is initialized when the extension
is loaded.
vnodeops
structure
This structure is defined in /usr/include/sys/vnode.h. Due to the size of this
structure, only a few lines are detailed below:
struct vnodeops {
/* creation/naming/deletion */
int
(*vn_link)(struct vnode *, struct vnode *, char *,
struct ucred *);
int
(*vn_mkdir)(struct vnode *, char *, int32long64_t,
struct ucred *);
int
(*vn_mknod)(struct vnode *, caddr_t, int32long64_t,
dev_t, struct ucred *);
int
(*vn_remove)(struct vnode *, struct vnode *, char *,
struct ucred *);
int
(*vn_rename)(struct vnode *, struct vnode *, caddr_t,
struct vnode *,struct vnode *,caddr_t,struct ucred *);
int
(*vn_rmdir)(struct vnode *, struct vnode *, char *,
struct ucred *);
/* lookup,
int
file handle stuff */
(*vn_lookup)(struct vnode *, struct vnode **, char *,
int32long64_t, struct vattr *, struct ucred *);
int
(*vn_fid)(struct vnode *, struct fileid *, struct ucred *);
/* access to files */
int
(*vn_open)(struct vnode *, int32long64_t, ext_t, caddr_t *,
struct ucred *);
int
(*vn_create)(struct vnode *, struct vnode **, int32long64_t,
caddr_t, int32long64_t, caddr_t *, struct ucred *);
-16 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
vfsops
Description
The vfsops structure, contains pointers to the filesystem dependant
operations that can be performed on the vfs, such as mount, unmount or
sync.
vfsops
management
There is one vfsops structure per filesystem kernel extension loaded (i.e.
one per unique filesystem type), and is initialized when the extension is
loaded.
vfsops
structure
This structure is defined in /usr/include/sys/vfs.h.
struct vfsops {
/* mount a file system */
int (*vfs_mount)(struct vfs *, struct ucred *);
/* unmount a file system */
int (*vfs_unmount)(struct vfs *, int, struct ucred *);
/* get the root vnode of a file system */
int (*vfs_root)(struct vfs *, struct vnode **,
struct ucred *);
/* get file system information */
int (*vfs_statfs)(struct vfs *, struct statfs *,
struct ucred *);
/* sync all file systems of this type */
int (*vfs_sync)();
/* get a vnode matching a file id */
int (*vfs_vget)(struct vfs *, struct vnode **, struct fileid *,
struct ucred *);
/* do specified command to file system */
int (*vfs_cntl)(struct vfs *, int, caddr_t, size_t,
struct ucred *);
/* manage file system quotas */
int (*vfs_quotactl)(struct vfs *, int, uid_t, caddr_t,
struct ucred *);
};
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
The Gnode
Introduction
Gnode represent an object in a file system implementation, and serves as
the interface between the logical file system and the file system
implementation. There is a one-to-one correspondence between a gnode
and an object in a file system implementation.
Overview
Each filesystem implementation is responsible for allocating and
destroying gnodes. Calls to the file system implementation serve as
requests to perform an operation on a specific gnode. A gnode is needed,
in addition to the file system inode, because some file system
implementations may not include the concept of an inode. Thus the gnode
structure substitutes for whatever structure the file system implementation
may have used to uniquely identify a file system object. The logical file
system relies on the file system implementation to provide valid data for
the following fields in the gnode:
• gn_type Identifies the type of object represented by the gnode.
• gn_ops Identifies the set of operations that can be performed on the
object.
Creation
A gnode refers directly to a file (regular, directory, special, and so on), and
is usually embedded within a file system implementation specific structure
(such as an inode). Gnodes are created as needed by file system specific
code at the same time as creating implementation specific structures. This
is normally immediately followed by a call to the vn_get kernel service to
create a matching vnode. The gnode structure is usually deleted either
when the file it refers to is deleted, or when the implementation specific
structure is being reused for another file.
Continued on next page
-18 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
The Gnode
gnode and
inode
-- continued
The gnode is typical embedded in an in-core inode. The member
gnode->gn_data points to the start of the inode.
Incore inode
gnode
gnode->gn_data
Structure
The gnode structure is defined in /usr/include/sys/vnode.h:
struct gnode {
enum vtype
gn_type;
/* type of object: VDIR,VREG etc */
short
gn_flags;
/* attributes of object */
ulong
gn_seg;
/* segment into which file is mapped */
long32int64
gn_mwrcnt;
/* count of map for write */
long32int64
gn_mrdcnt;
/* count of map for read */
long32int64
gn_rdcnt;
/* total opens for read */
long32int64
gn_wrcnt;
/* total opens for write */
long32int64
gn_excnt;
/* total opens for exec */
long32int64
gn_rshcnt;
/* total opens for read share
*/
struct vnodeops *gn_ops;
struct vnode *gn_vnode;
/* ptr to list of vnodes per this gnode */
dev_t
gn_rdev;
/* for devices, their "dev_t" */
chan_t
gn_chan;
/* for devices, their "chan", minor’s minor */
Simple_lock
gn_reclk_lock; /* lock for filocks list */
int
gn_reclk_event;/* event list for file locking */
struct filock *gn_filocks;
/* locked region list */
caddr_t
gn_data;
/* ptr to private data (usually contiguous)
*/
}
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Exercise 1
Overview
This exercise will test you knowledge of the data structures of the LFS and
VFS and the relationships between them.
lab
Use the following list of terms to best complete the statements below.
File
vfs
File system
vnodeops
System File Table
vmount
1. A vnode represents a ______________.
2. A vfs represents a _____________.
3. The gfs contains pointers to the ufsops and the _____________.
4. The ___________ structure contains specifics about a mount request.
5. The ____________ has one entry for each open file on the system.
Answer the following two questions by completing this diagram as
directed.
u-block
inode
gnode
User File
Descriptor
Table
System File
Table
vfs
vnodeops
vfsops
Logical File System
Virtural File System
(Vnode-VFS Interface)
File System
6. Label the blocks representing the vnode, vmount and gfs structures
7. Draw a line representing the file pointer in the ufd to an entry in the
system file table.
-20 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Guide
Lab Exercise 1
Overview
In the following exercise you will run a small C program that opens a file,
initializes it by writing a few bytes to it, then pauses. The pause allows us
to investigate the various LFS structures that are created by opening the
file, using the appropriate system debugger.
The program
The C code for the example is:
#include <fcntl.h>
main()
{
int fd;
fd=open("foo", O_RDWR | O_CREAT);
write(fd, "abcd", 4);
close(fd);
fd=open("foo", O_RDONLY);
printf("fd = %d\n", fd);
pause();
}
The close() then open() is required, to ensure that the write is committed to
disk & hence that the inode is updated.
save this code to a file called t.c, and compile it using “make t”.
Lab
Follow the steps in the table below.
Stage
1
2
3
Description
Enter the C program from above, save it to a file called t.c
and compile with the command:
$ make t
Execute the program created in the last step. It will print the
file descriptor number of the file it creates, then pauses.
$ ./t
fd = 3
From another shell on the same system, enter the system
debugger (kdb or iadb).
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Lab Exercise 1
-- continued
Lab
Stage
4
Description
Initially, we need to find the address of the file structure for
the open file. We know that the file descriptor for our
program is number 3, so we have to find the mapping
between the file descriptor number and the file structure.
This mapping is done from the file descriptor table in the
uarea structure for the process. To find the uarea, find the
slot number in the thread table that our “t” process occupies,
the uarea slot number will be the same.
For kdb use the “th *” command to display all the threads.
Page down through the entries until you find the correct
entry:
(0)> th *
SLOT NAME
pvthread+000000
...
pvthread+001D00
...
5
0 swapper
55 t
STATE
TID PRI
RQ CPUID
CL WCHAN
SLEEP 000003 010
1
0
SLEEP 003A39 03C
1
0
Now use the command “uarea” on this thread slot number,
to view the user area (which contains the file descriptor
table), and page down through the output until you find the
“File descriptor table”:
(0)> u 55
File descriptor table at..F00000002FF3CEC0:
fd
0 fp..F100009600007430 count..00000000 flags.
fd
1 fp..F100009600007430 count..00000000 flags.
fd
2 fp..F100009600007430 count..00000000 flags.
fd
3 fp..F100009600007700 count..00000000 flags.
Rest of File Descriptor Table empty or paged out.
...
ALLOCATED
ALLOCATED
ALLOCATED
ALLOCATED
Continued on next page
-22 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Lab Exercise 1
Guide
-- continued
lab
Stage
6
Description
The file structure for file descriptor 3 is at address
F100009600007700. Use the “file” command along with this
address to display the contents of the structure:
(0)!ILOH)
$''5&28172))6(7'$7$7<3()/$*6
))$912'(5($'
QRGHVORW
IBIODJIBFRXQW
IBRSWLRQVIBW\SH
IBGDWD)$IBRIIVHW
IBGLUBRIIIBFUHG)&&
IBORFN#)IBORFN
IBRIIVHWBORFN#)IBRIIVHWBORFN
IBYLQIRIBRSV$&&(YQRGHIRSV
912'()$
YBIODJYBFRXQW
YBYIVJHQYBYIVS))'
YBORFN#)$YBORFN
YBPYIVSYBJQRGH)$)
YBQH[WYBYIVQH[W)%)
YBYIVSUHYYBSIVYQRGH
YBDXGLW
8
Note that half way down the output, the address of the
vnode structure that corresponds to this file is printed,
followed by the contents of this vnode structure.
(We could also display the vnode structure separately by
running the kdb command “vnode” with the address
F10000971528A380.)
There are two items that we are interested in from the vnode
structure displayed in the last step, the v_vfsp address,
which points to the filesystem that contains the vnode, and
the v_gnode address, which points to the gnode structure
for the file. From the gnode we can display the inode
structure for the file.
Initially, display the gnode address, using the kdb command
“gnode” with the address F10000971528A3F8.
(0)> gnode F10000971528A3F8
GNODE............ F10000971528A3F8 KERN_heap+528A3F8
gn_type....... 00000001 gn_flags...... 00000000
gn_seg........ 00000000000078AD
gn_mwrcnt..... 00000000 gn_mrdcnt..... 00000000 gn_rdcnt...... 00000001
gn_wrcnt...... 00000000 gn_excnt...... 00000000 gn_rshcnt..... 00000000
gn_ops........ 00000000003D7DC8 jfs_vops
gn_vnode...... F10000971528A380 gn_rdev....... 8000000A00000008
gn_chan....... 00000000 gn_reclk_event 00000000FFFFFFFF
gn_reclk_lock@ F10000971528A440 gn_reclk_lock. 0000000000000000
gn_filocks.... 0000000000000000 gn_data....... F10000971528A3D8
gn_type....... REG
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Lab Exercise 1
-- continued
Lab
Step
9
Action
The inode address is contained in the gn_data field, in this
case F10000971528A3D8. Use the kdb command “inode”
to display this structure:
(0)!LQRGH)$'
'(9180%(5&177<3()/$*6
.(51BKHDS$'$5(*
IRUZ))(EDFN))(
QH[W)$'SUHY)$'
JQRGH#)$)QXPEHU
GHY$LSPQW)()(
IODJORFNVELJH[SFRPSUHVV
FIODJFRXQWV\QFVQ'$LG&
PRYHGIUDJRSHQHYHQW))))))))))))))))
KLS))(QRGHORFN
QRGHORFN#)$$GTXRW>865@
GTXRW>*53@GLQRGH#)$&
FOXVWHUUFOXVWHUGLRFQWQRQGLR
VL]HJHWV
*12'()$)
JQBW\SHJQBIODJV
JQBVHJ$'
JQBPZUFQWJQBPUGFQWJQBUGFQW
JQBZUFQWJQBH[FQWJQBUVKFQW
JQBRSV''&MIVBYRSV
JQBYQRGH)$JQBUGHY$
JQBFKDQJQBUHFONBHYHQW))))))))
JQBUHFONBORFN#)$JQBUHFONBORFN
JQBILORFNVJQBGDWD)$'
JQBW\SH5(*
GLBJHQ))&GLBPRGH&GLBQOLQN
GLBDFFWGLBXLGGLBJLG
GLBQEORFNVGLBDFO
GLBPWLPH&)'GLBDWLPH&)'GLBFWLPH&)'
GLBVL]HBKLGLBVL]HBORGLBVHF
GLBUGDGGU
GLBYLQGLUHFWGLBULQGLUHFW
GLBSULYRIIVHWGLBSULYIODJVGLBSULY
912'()$
YBIODJYBFRXQW
YBYIVJHQYBYIVS))'
YBORFN#)$YBORFN
YBPYIVSYBJQRGH)$)
YBQH[WYBYIVQH[W)%)
YBYIVSUHYYBSIVYQRGH
YBDXGLW
Continued on next page
-24 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Lab Exercise 1
Guide
-- continued
lab
Step
10
Action
The inode command displays the inode, gnode and vnode
structures.
The member number in the inode structure should contain
the inode number in hex of the file foo. Verify this inode
number matches the inode number displayed by the
command :
$
11
ls -lia foo
Don’t forget to convert the inode number from hex to
decimal.
The dev field displays the major and minor number of the
logical volume for the filesystem.
For example:
64 bit systems: 8000000A00000007 -> major=10 minor=7
32 bit systems: 000A0007 -> major=10 minor=7
Verify this number with the command:
$ ls -lia /dev/<logical volume>
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 26
Guide
Draft Version for review, Sunday, 15. October 2000, fs1.fm
Lab Exercise 2
Overview
The instructor will create a simple shell script that simply prints its process
id, then pauses.
Both the “ps” command, and the process and thread tables entries for this
script will simply list the name of the program as the name of the shell that
it is being executed by (E.g. “ksh”).
Objective
To determine the name of the script that the instructor is running.
Tips
• Remember that the shell will have to open() the script prior to executing
it.
• The command find . -inum xxx can be used to find the name of a
file given the filesystem name and an inode number.
-26 of 26 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Unit 14. AIX 5L boot
Objectives
After completing this unit, you should be able to :
• List and locate boot components and their usage
• Understand the 3 Phases of rc.boot
• Understand the contents and usage of a RAMFS
• Understand the ODM structure and the usage of ODM classes
• Create new boot images
• Debug boot problems
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
What is boot
Definition
It is the process that begins when the computer is powered up and
continues until the entries in the init table have been processed.
ROS process
System ROS (Read Only Storage), contains firmware that is independent
of the operating system which initializes the hardware and loads AIX.
All platforms except RS6K will use an intermediate boot process called :
• Softros : (/usr/lib/boot/aixmon_chrp) for CHRP systems
• Softros : (/usr/lib/boot/aixmon_rspc) for RSPC systems
• Boot loader : (/usr/lib/boot/boot_elf) for IA-64 systems
AIX process
AIX begins execution after system ROS firmware or the intermediate boot
process finishes its execution :
• sets up firmware information
• kernel initialization
• RAM filesystem based configuration
• control is passed to files based in the permanent filesystem (this may be
a disk or network filesystem)
• /etc/inittab entries are processed. This usually includes enabling the
user login process.
-2 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Various Types of boot
Devices
AIX can boot from the following types of devices :
• hard disk boot
• CD-ROM boot
• tape boot (Not supported on IA-64 platform)
• network boot
Configuration
The boot process can use one of the following boot configurations :
• standalone
• diskless/dataless (Not supported on IA64 platform)
• operating system installation/software maintenance
• diagnostics
Hard disk boot
The hard disk boot has the following characteristics :
• the boot image resides on the hard disk
• the RAM filesystem contains the files necessary for configuring the hard
disk(s), and then accessing the filesystems that reside in the root
volume group (rootvg)
• this is the most common system configuration
• these types of systems are also known as “standalone” systems
• these types of systems may also be booted into the diagnostics
functions
CDROM boot
The CDROM boot maybe used in the following situations :
• operating system installation
• diagnostics
• hard disk boot failure recovery/maintenance
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Various Types of boot
Tape boot
-- continued
The Tape boot device can be used for :
• operating system installation
• hard disk boot failure recovery/maintenance
The tape boot device is usually used for creating bootable system backups
The tape boot device is not supported on IA-64 platform.
Network boot
The network boot can be used for the following purposes :
• boot and install the operating system
• the operating system is installed on a hard disk with NIM
• subsequent boots are from the hard disk
• supported diskless/dataless configurations
• diagnostics
• hard disk boot failure recovery/maintenance
The centralized boot/filesystem servers offer convenient administration
-4 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Systems types and Kernel images
System Types
There are four basic hardware architecture types:
• RS6K - the “classic” IBM workstation
• RSPC - the PowerPC Reference Platform workstation
• CHRP - Common Hardware Reference Platform
• IA-64 - Intel IA-64 Platform
boot images
types
There are three corresponding types of boot images:
• The RS6K uses an hardware ROS to build the IPL Control Block
• The RSPC and CHRP uses a SOFTROS to build the IPL Control Block
• The IA-64 use an EFI boot loader to build the IPL Control Block
kernel types
There are four types of Kernels loaded:
• 32 bits Power UP (/unix->/usr/lib/boot/unix_up)
• 32 bits Power MP (/unix->/usr/lib/boot/unix_mp)
• 64 bits Power (/unix->/usr/lib/boot/unix_64)
• 64 bits IA-64 (/unix->/usr/lib/boot/unix_ia64)
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
RAMFS and prototype files
Introduction
In order to successfully boot a system, the AIX kernel will need basic
commands, configuration files, kernel extensions and device drivers to be
able to configure a minimum environment.
All the files needed are included in the RAMFS using the following
command
mkfs -V jfs -p <proto> <temp_filesystem_file>
prototypes
files
description
A prototype file is a list of file and file descriptions that are needed to create
a RAMFS.
A prototype file entry format is as follow :
<dest_file_name>
<type> <mode> 0 0 <full_path_name>
Where :
• <dest_file_name> : is the name of the file, directory, link or device as it
will be written to the RAMFS
• <type> : defines the type of the entry and can be :
• d--- : a directory entry (this will change the relative path of the
following entries).
• l--- : a link (the target will be listed in the <full_path_name>
parameter)
• b--- : a block device (the <full_path_name> parameter will represent
the major and minor numbers)
• c--- : a character device (the <full_path_name> parameter will
represent the major and minor numbers)
• ---- : a file
• <mode> : represent the file permissions in octal format
• <full_path_name> : value will depend on the <type> as described
before.
Continued on next page
-6 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
RAMFS and prototype files
prototypes
files types
Guide
-- continued
Prototype files are divided in several parts according to their specific use :
• Prototypes files located in /usr/lib/boot are the base prototypes used for
a platform according to the boot device type and comes with the
platform base system device fileset
• Prototypes files located in /usr/lib/boot/network are specific to any
general kind of network boot device and comes with the platform base
system device fileset
• Prototypes files located in /usr/lib/boot/protoext are used for any specific
type of boot device and comes with the device specific fileset
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Boot Image Creation
Introduction
In order to successfully boot from a device, the administrator will need to
run commands that will create the boot structure.
bosboot
command
The bosboot command is the most commonly used on AIX because it will
manage all verification tasks and environment setup for the administrator.
The administrator can also use the mkboot command but he then should
take care himself of all these preliminary checks.
The bosboot command will also be used by over commands like mksysb
or installp post installation process when installing packages that needs to
build a new boot image.
bosboot
process
overview
The bosboot command will do the following :
• set execution environment
• parse command line arguments
• verify syntax and arguments
• point to platform specific files (like mkboot_chrp or aixmon_rspc)
• check for space needed in /tmp and destination filesystem if needed
• create a RAMFS if requested using mkfs and proto files
• create a bootimage and a boot record if requested using the appropriate
mkboot command
• copy the boot image and savebase to the boot device if requested.
• cleanup execution environment
Continued on next page
-8 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Boot Image Creation
bosboot
parameters
Guide
-- continued
The most commonly used bosboot command is :
# bosboot -a -d /dev/hdisk0
For example if you need to load and invoke the kernel debugger you can
use :
bosboot -a -I -d /dev/hdisk0
The following table list the bosboot parameters that can be used :
argument
description
-a
Create complete boot image and device.
-w file
Copy given boot image file to device.
-r file
Create ROS Emulation boot image.
-d device
Device for which to create the boot image.
-U
Create uncompressed boot image.
-p proto
Use given proto file for RAM disk file system.
-k kernel
Use given kernel file for boot image.
-l lvdev
Target boot logical volume for boot image.
-b file
Use given file name for boot image name.
-D
Load Low Level Debugger.
-I
Load and Invoke Low Level Debugger.
-L
Enable MP locks instrumentation (MP kernels)
-M norm|serv|both Boot mode - normal or service
-O offset
boot image offset for CDROM file system.
-q
Query size required for boot image.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
AIX 5L Distributions
Introduction
AIX 5L will be delivered in two separate distributions :
• One for Power systems
• One for Intel IA-64 systems
Power CDROM
Distributions
The distribution CDROM that IBM provides to our customers has three
boot images. There is a boot image for the RS6K computers, a second for
the RSPC computers, and a third for CHRP (/ppc/chrp/bootfile.exe). The
RS6K, RSPC, and CHRP UP computers can use the MP Kernel, which is
the method implemented for distribution media that goes to our customers.
In other words, when a customer receives boot/install media from IBM,
there is no need to determine whether the system is UP or MP. This boot
image is created using the MP kernel. The UP kernel is more efficient for
uniprocessor systems, but the strategy of a single boot image for both
hardware platform types lowers distribution cost, and is more convenient
for our customers.
IA-64 CDROM
Distributions
-10 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Checkpoint
Introduction
Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Quiz
1. What is the name of the file used as a SOFTROS on CHRP systems
2. Does an IA-64 support 32 bit kernel
3. What are the common functions of the ROS, SOFTROS and EFI boot
loader.
4. List the 4 platforms supported by AIX 5L
5. What is the purpose of the RAMFS
6. How to create a RAMFS
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Instructor Notes
Purpose
Notes on Quiz and transition to the next section
Quiz
responses
The responses for the Quiz are :
1. What is the name of the file used as a SOFTROS on CHRP systems
• /usr/lib/boot/aixmon_chrp
2. Does an IA-64 support 32 bit kernel : NO
3. What are the common functions of the ROS, SOFTROS and EFI boot
loader.
• create the IPLCB
• load the kernel
4. list the 4 platforms supported by AIX 5L
• RS6K
• RSPC
• CHRP
• IA-64
5. What is the purpose of a RAMFS :
• Get basic commands, configuration files, kernel extensions and
device drivers in order to be able to bring a minimum environment.
6. How to create a RAMFS :
• Using mkfs and prototype files.
Transition
Statement
Now we will describe:
• the Power specific boot process if this is a Power course
• the IA-64 specific boot process if this is a IA-64 course
-12 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
The Power Boot Mechanism
Introduction
The section will explain the boot mechanism used by Power family
systems.
Boot overview
When the system is powered on, the ROS or the firmware will look for the
bootrecord on the device pointed by the bootlist to find the boot entry point.
The Softros on RSPC and CHRP will execute and uncompress the boot
image if needed using the bootexpand process.
Then it will load the kernel that will initialize.
The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)
The ssh will then call rc.boot for PHASE I and PHASE II specific to each
boot device types.
Then init will execute rc.boot phase 3 and the remaining common code in
rc.boot for disk and network boot devices
Boot diagram
The following diagram represent the high level boot process overview.
execution of the system ROS
or firmware.
boot record read from boot device
rspc
or chrp
boot
yes
execution
of softros
no
yes
com
pressed
boot
img
execution
bootexpand
no
Kernel initialization
Kernel call init (/usr/lib/boot/ssh)
init ssh call rc.boot PHASE I&II
init exit to newroot
init calls rc.boot PHASE III from
inittab and process the rest of
inittab entries.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Power boot disk layout
Boot image
overview
The following chart describes a Power boot disk :
boot disk
hd5
compressed
kernel
compressed
RAM Filesystem
bootexpand
VGDA softros (chrp and rspc)
bootrecord
rest of
base
the boot
customized
disk
data
bootrecord
512 byte block containing size and location of the boot image. The boot
record is the first block on a disk or cdrom and is therefore separated from
the boot image. The boot image on a disk is placed in the boot logical
volume which is a reserved contiguous area.
softros
RSPC and CHRP platform uses a SOFTROS program (/usr/lib/boot/
aixmon_rspc or /usr/lib/boot/aixmon_chrp) that performs system
initialization for AIX that the hardware firmware in ROS does not provide,
such as appending device information to the IPL control block.
Continued on next page
-14 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Power boot disk layout
Guide
-- continued
bootexpand
Program to expand compressed boot image which is executed before
control is passed to kernel. The compression of a boot image is optional
but it is the default since the image size is less than half of an
uncompressed image and requires less time to load from the media.
kernel
AIX 32 bits UP, 32 bits MP or 64 bits MP kernels that which control
passes to after expansion by bootexpand. The kernel initializes itself and
then passes control to the simple shell init (ssh) in the RAM filesystem.
RAM
filesystem
Filesystem used during boot process, that contains programs and data
for initializing devices and subsystems in order to install AIX, execute
diagnostics, or to access and bring up the rest of AIX.
base
customized
data
Area of the hard disk boot logical volume containing user configured
ODM device configuration information that is used by the system
configuration process.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
AIX 5L Power boot record
Introduction
On Power systems, the boot record is located at the beginning of the boot
device and contains the following informations :
• The IPL record
• The boot partition table used by chrp and rspc systems.
IPL record
description
The following table describe the content of the boot record.
size offse
name
t
4
0
IPL_record_id
20
4
4
24
reserved1
formatted_cap
1
28
last_head
1
29
last_sector
6
4
30
36
4
40
reserved2
boot_code_lengt
h
boot_code_offse
t
4
4
44
48
boot_lv_start
boot_prg_start
4
4
52
56
boot_lv_length
boot_load_add
1
60
boot_frag
description
This physical volume contains a valid
IPL record if and only if this field
contains IPLRECID in EBCDIC ’IBMA’
Formatted capacity. The number of
sectors available after formatting.
THIS IS DISKETTE INFORMATION The
number of heads minus 1.
THIS IS DISKETTE INFORMATION The
number of sectors per track.
Boot code length in sectors. A 0 value
implies no boot code present
Boot code offset. Must be 0 if no boot
code present, else contains byte offset
from start of boot code to first instruction.
Contains the PSN of the start of the BLV.
Boot code start. Must be 0 if no boot
code present, else contains the PSN of
the start of boot code.
BLV length in sectors.
512 byte boundary load address for boot
code.
0x1 => fragmentation allowed
Continued on next page
-16 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
AIX 5L Power boot record
-- continued
IPL record
description
continued
size offse
name
t
1
61
boot_emulation
2
62
reserved3
2
64
basecn_length
description
0x1 => ROS network emulation code
Number of sectors for base
customization. Normal mode.
Number of sectors for base
customization. Service mode.
Starting PSN value for base
customization. Normal mode.
Starting PSN value for base
customization. Service mode.
2
66
basecs_length
4
68
basecn_start
4
72
basecs_start
24
4
76
100
4
104
4
4
108
112
4
4
116
120
1
124
1
2
8
376
125
126
128
136
reserved4
ser_code_length Service code length in sectors. A 0 value
implies no service code present.
ser_code_offset Service code offset. 0 if no service code
is present, else contains byte offset from
start of service code to first instruction.
ser_lv_start
Contains the PSN of the start of the SLV.
ser_prg_start
Service code start. Must be 0 if service
code is not present, else contains the
PSN of the start of service code.
ser_lv_length
SLV length in sectors.
ser_load_add
512 byte boundary load address for
service code.
ser_frag
Service code fragmentation flag. Must
be 0 if no fragmentation allowed, else
must be 0x01.
ser_emulation
ROS network emulation flag
reserved5
pv_id
The unique identifier for this PV.
dummy
Include the partition table.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
AIX 5L Power boot record
boot partition
table
The boot record contains 4 partition tables entries starting at offset 0x1be.
Each entry contains the following information :
size in byte
1
1
1
1
1
1
1
1
4
4
boot partition
tables entries
-- continued
name
boot_ind
begin_h
begin_s
begin_c
syst_ind
end_h
end_s
end_c
RBA
sectors
description
Boot indicator
Begin head
Begin sector
Begin cylinder
System indicator
End head
End sector
End cylinder
Relative block address in little endian format
Number of sectors in little endian format
RS6K platform doesn’t use a boot partition table. The four boot partition
table entries are used for :
• CHRP boot images
• CHRP and First RSPC boot image
• CHRP and Second RSPC boot image
• CHRP Third RSPC boot image
Continued on next page
-18 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
AIX 5L Power boot record
Example
-- continued
The following chart represent an AIX 5L boot record from a chrp system. It
was obtained using :
od -Ax -x /dev/hdisk0|pg
IBMA
boot_code
_len
base_cn
_length
base_cs
_length
0000000
0000010
0000020
0000030
0000040
0000050
0000060
0000070
0000080
0000090
c9c2
0000
0000
0000
0100
0000
0000
0000
0007
0000
d4c1
0000
0000
0000
0100
0000
0000
0000
1483
0000
PVID
0000 0000
0000 0000
0000 2cc1
0000 0000
0000 3cdc
0000 0000
0000 2cc1
0000 0000
229d 0662
0000 0000
serv_code
_length
base_cs
_start
boot_lv
_start
0000 0000
0000 0000
0000 0000
0000 0000
0000 3cdc
0000 0000
0000 0000
0000 0000
0000 0000
0000 0000
base_cn
_start
0000 0000
0000 0000
0000 1100
0000 0000
0000 0000
0000 0000
0000 1100
0000 0000
0000 0000
0000 0000
ser_lv
_start
BOOT_SIGNATURE boot_partition
_table
00001b0
00001c0
00001d0
00001e0
00001f0
0000200
0000210
0000220
0000230
0000240
0000250
0000260
0000270
0000
0000
ffff
ffff
ffff
4182
7ca4
57de
4182
4182
7fde
3fc0
67ff
0000
0000
41ff
41ff
41ff
000c
2b78
063e
001c
000c
1814
8000
0080
0000
0000
ffff
ffff
ffff
3880
83c3
2c1e
2c1e
2c1e
83de
7fcf
93fe
0000
0000
1b11
0211
1b11
0000
0098
0057
0059
0082
006c
01a4
10c0
0000
0000
0000
0000
0000
4800
7fde
4182
4182
4082
2c1e
3fc0
31ad
RBA
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
0000
0000
c12c
1900
c12c
000c
1814
0024
0014
0030
0000
f000
ffd8
0000
0000
0000
0000
0000
7c83
83de
2c1e
2c1e
83c3
4182
83fe
30c3
0000
80ff
00ff
80ff
55aa
2378
0034
0058
0072
0288
001c
10c0
0080
sectors
-19 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Instructor Notes
Purpose
Notes on Power boot record
Little endian
format
The RBA and sectors informations from the boot partition table are little
endian format.
So to obtain the actual address, you will need to swap the 2 bytes as they
are display using the od command
-20 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Power boot images structures
Introduction
Depending on the architecture, the boot image will not always contains the
same elements due to the needs of ROS and Firmware specifications.
RS6K boot
image
The rs6k platform doesn’t need a an softros emulation, so the boot image
start with the bootexpand program. The bootexpand will be loaded first to
uncompress the kernel and the RAMFS.
RSPC boot
image
On rspc, the aixmon_rspc softros is located at the begening of the boot
image, but the xcoff format is replaced by an hints structure has defined in
/usr/include/sys/boot.h. So an RSPC boot image will contain the following
sections :
• The hints structure
• The aixmon_rspc file reduced by it’s xcoff header and in fact starting at
its entry point
• The bootexpand program
• The compressed kernel
• The compressed RAMFS
• The saved base customization.
CHRP boot
image
On chrp, the aixmon_chrp softros is located at the begening of the boot
image, but the xcoff format is replaced by an ELF format. So a CHRP boot
image will contain :
• The ELF structure
• The aixmon_chrp file reduced by it’s xcoff header and in fact starting at
its entry point.
• The bootexpand program
• The compressed kernel
• The compressed RAMFS
• The saved base customization.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
RSPC boot image hints header
introduction
On rspc systems, the aixmon xcoff header is replaced by an hints
structure. The aixmon_rspc file is copied to the boot image after the hints
structure starting at it’s entry point.
hints boot
structure
description
The following table represents the hints structure :
size
name
description
4
signature
Signature for boot program ‘0x4149584d’
4
resid_data_address address of residual data as determined by
firmware
4
bss_offset
Address of bss section
4
bss_length
Length of bss section
4
jump_offset
Jump offset in boot image
4
load_exec_address address of boot loader as determined by
firmware
4
header_size
Size of header
4
header_block_size Offset to AIX boot image
4
image_length
Size of boot program
4
Spare
4
res_mem_size
reserved memory size
4
mode_control
Boot mode control ‘0xDEAD0000 |
mode_control’
Continued on next page
-22 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
RSPC boot image hints header
RSPC boot
image example
-- continued
The following output represents the hints header output from the following
command :
# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa x
0000000
*
0000200
0000210
0000220
*
0000400
0000410
0000420
0000430
0000 0000 0000 0000 0000 0000 0000 0000
3004 0000 00 fe 3200 0002 4149 5820 2034
2033 2030 3130 3130 3035 3437 3000 0000
0000 0000 0000 0000 0000 0000 0000 0000
4149
0000
0001
4800
584d 0000 0000 0000 ff d4 0000 022c
038c 0000 0000 0000 0400 0000 0097
2810 0000 0000 0000 0000 dead 00c0
0005 7e80 00a6 7e94 a278 3a94 1000
aixmon entry point
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
CHRP Boot image ELF structure
introduction
On chrp systems, the aixmon xcoff header is replaced by an ELF header.
The aixmon_chrp file is copied to the boot image after the ELF header
starting at it’s entry point.
ELF boot
header
description
The ELF boot header is made of :
• ELF header structure
• Note section description
• loader section 1 description
• loader section 2 description
• Note data description
• The boot loader parameters data
ELF header
structure
description
The Following table describes the ELF header structure :
size
16
2
2
4
4
4
4
4
2
2
2
2
2
2
name
e_ident
e_type
e_machine
e_version
e_entry
e_phoff
e_shoff
e_flags
e_ehsize
e_phentsize
e_phnum
e_shentsize
e_shnum
e_shstrndx
description
ELF identification
object file type
architecture
object file version
entry point
prog hdr byte offset
section hdr byte offset
processor specific flags
ELF header size
prog hdr table entry size
prog hdr table entry count
section header size
section header entry count
sect name string tbl idx
Continued on next page
-24 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
CHRP boot image ELF structure - Continued
Note, load 1
and load2
segments
descriptions
The following table describes the structure used to format note, loader 1
and loader 2 segments :
size
4
4
4
4
4
4
4
4
Note data
description
name
p_type
p_offset
p_vaddr
p_paddr
p_filesz
p_memsz
p_flags
p_align
description
segment type
offset to this segment
virt addr of seg in memory
phy addr of seg in memory
file image segment size
mem image segment size
segment flags
segment alignment
The following table represent the note data description structure :
size
4
4
4
8
4
4
4
4
4
4
name
namesz
descsz
type
name
real_mode
real_base
real_size
virt_base
virt_size
load_base
description
size of name
size of descriptor
descriptor interpretation
the owner of this entry
ISA env variable
ISA env variable
ISA env variable
ISA env variable
ISA env variable
ISA env variable
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
CHRP boot image ELF structure - Continued
Boot loader
parameters
description
The following table describes the boot loader structure :
size
name
4
timestamp
4
bootimage_size
description
date when the boot image was created
equivalent to the number of sectors for the
blv found in the bootrecord
boot_loader_size size of the aixmon in bytes
inst_offset
jump offset in boot image
rmalloc_size
Percent of memory for kernel heap
reserved1
reserved2
reserved3
4
4
4
4
4
4
example
Use the following command to display the ELF structure:
# dd if=<boot_disk> bs=512 skip=<RBA> count=1 |od -Xa x
load_phdr1
note_phdr elf_hdr
IF
H
IIIIIIII
IIIIIIIIF
FF
F
HH
HH
load_phdr2 DIIIIIIIIIF
note_data EIIIIIIIIIIIIIIIIIIIIIIII
FHHF
G
BL_parms_data HIHDIFG
IEEEEE
DFD
FI
aixmon entry point
-26 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Exercise
Introduction
This exercise will show you the way to locate the different parts of the boot
image using the boot record
Procedure
Follow the following procedure to locate main parts of the boot image.
Step
1
2
3
Action
Locate the boot disk using :
# bootinfo -b
Determine the architecture of your system using :
# bootinfo -p
Find the boot record located at the beginning of the disk
found in step 1 using :
# dd if=<boot_disk> bs=512 count=1 |od -Ax -x
4
5
6
7
8
9
• On RSPC or CHRP, locate in the boot partition table the
RBA and sectors from output of step 3.
• On RS6K, locate in the record, the boot_prg_start and
boot_code_length
Create a file using the offset and sectors length found in
step 5 using :
# dd if=<boot_disk> bs=512 skip=<offset>
count=<sectors> of=/tmp/myfile
Using the what command try to find what is included in this
file
What is missing from the what output ?
Why ?
Create a file using the offset and sectors length found in
step 5 plus the size of the boot_loader
# dd if=<boot_disk> bs=512
skip=<(offset*512)+boot_loader_size)>
count=512 of=/tmp/myfile2
What is myfile2
Using the results from step 3, locate the base customization
sector start and length : use these values to create a new
file
# dd if=<boot_disk> bs=512 skip=<base_cn_start>
count=<base_cn_length> of=/tmp/myfile3
10
Create a directory <dir1> and copy /etc/objrepos/* to dir1
# /usr/lib/boot/restbase -o myfile3 -d dir1 -v
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Instructor Notes
Purpose
Notes on boot record and image exercise
Details
Step 6 should output something like :
07 1.3 src/rspc/usr/lib/boot/aixmon_chrp/
cl_in_services.c, chrp_softros, rspc500, 0025A_500 10/
22/98 14:25:3904
1.32 src/rspc/usr/lib/boot/
aixmon_chrp/aixmon_chrp.c, chrp_softros, rspc500,
0026A_500 6/16/00 12:43:2509 1.2 src/rspc/usr/lib/
boot/aixmon_chrp/printf.c,chrp_softros, rspc500,
0025A_500 1/13/99 10:38:0208
1.40 src/rspc/usr/
lib/boot/aixmon_chrp/iplcb_init.c, chrp_softros,
rspc500, 0029A_500 7/17/00 14:07:1139
1.5 src/
rspc/usr/lib/boot/aixmon_chrp/numa_topo.c,
chrp_softros, rspc500, 0028A_500 6/7/00 08:11:2148
1.1 src/rspc/usr/lib/boot/aixmon_chrp/rtas_func.c,
chrp_softros, rspc500, 0026A_500 6/16/00 13:04:3265
1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot,
bos500, 0025A_500 4/14/00 14:26:38
So it reflect the presence of the softros (aixmon_chrp) and the bootexpand
codes.
Here we are missing the kernel and ramfs because the are stripped and
then unreadable for the what command.
Step 8 should output something like :
# what /tmp/myfile2
/tmp/myfile2:
65
1.21 src/bos/usr/sbin/bootexpand/expndkro.c, bosboot,
bos500, 0
After completing the step 10, students should observe that the following
files were updated by the restbase command. That confirms that myfile3 is
actually the base customization area.
-rw-r--r-- 1 root
system
32768 Aug 23 15:51 CuDvDr
-rw-r--r-- 1 root
system
4096 Aug 23 15:51 CuPath
-rw-r--r-- 1 root
system
4096 Aug 23 15:51 CuPath.vc
-rw-r--r-- 1 root
system
4096 Aug 23 15:51 CuPathAt
-rw-r--r-- 1 root
system
4096 Aug 23 15:51 CuPathAt.vc
-rw-r--r-- 1 root
system
16384 Aug 23 15:51 CuAt
-rw-r--r-- 1 root
system
8192 Aug 23 15:51 CuAt.vc
-28 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
-rw-r--r-- 1 root
system
4096 Aug 23 15:51 CuDep
-rw-r--r-- 1 root
system
16384 Aug 23 15:51 CuVPD
-rw-r--r-- 1 root
system
12288 Aug 23 15:51 CuDv
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Power ROS and Softros
ROS
On RS6K platforms, the Hardware ROS performs some basic hardware
configuration and tests, and create the IPL Control Block before
transferring control to kernel’s entry point.
Softros
The RSPC and CHRP family of computers requires a boot image with
special software known as SOFTROS, which is used to provide function
that AIX requires, and is not provided by the hardware firmware. The
SOFTROS performs some basic hardware configuration and tests, and
also sets up some data structures to provide an environment for AIX that
more closely resembles the environment provided by RS6K system ROS.
On CHRP systems the firmware device tree is also appended to the IPL
Control Block. The the Softros transfer control to kernel’s entry point.
-30 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
IPLCB on Power
Definition
The IPLCB (Initial Program Load Control Block) defines the RAM resident
interface between the IPL Boot Process and the Operating System
The ROS or Softros will initialize the IPLCB structure using interfaces to
the firmware or ROS (on RS6K platform).
The kernel when loaded will use the IPLCB structure to initialize it’s
runtime structures.
IPLCB
Description
The IPLCB contains the following structures (described in : /usr/include/
sys/iplcb.h) :
• IPLCB Directory : contains the IPLCB ID and pointers (offset and size to
IPLCB Data)
• IPLCB Data such as :
• processor information ('ipl -proc [cpu]')
• memory region ('ipl -mem')
• system information ('ipl -sys')
• user information ('ipl -user')
• NUMA information ('ipl -numa')
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
IPLCB on Power
IPLCB
directory
example on a
CHRP system
-- continued
The following screen output shows the IPLCB on a CHRP system captured
using the kdb iplcb -dir sub command :
IPL directory [10000080]
ipl_control_block_id.........ROSIPL
ipl_cb_and_bit_map_offset...00000000
bit_map_offset..............000087A8
ipl_info_offset.............000002E8
iocc_post_results_offset....00000000
nio_dskt_post_results_offset00000000
sjl_disk_post_results_offset00000000
scsi_post_results_offset....00000000
eth_post_results_offset.....00000000
tok_post_results_offset.....00000000
ser_post_results_offset.....00000000
par_post_results_offset.....00000000
rsc_post_results_offset.....00000000
lega_post_results_offset....00000000
keybd_post_results_offset...00000000
ram_post_results_offset.....00000000
sga_post_results_offset.....00000000
fm2_post_results_offset.....00000000
net_boot_results_offset.....00000000
csc_results_offset..........00000000
menu_results_offset.........00000000
console_results_offset......00000000
diag_results_offset.........00000000
rom_scan_offset.............00000000
sky_post_results_offset.....00000000
global_offset...............00000000
mouse_offset................00000000
vrs_offset..................00000000
taur_post_results_offset....00000000
ent_post_results_offset.....00000000
vrs40_offset................00000000
gpr_save_area1............@ 10000178
system_info_offset..........00000880
buc_info_offset.............0000091C
processor_info_offset.......00000A6C
fm2_io_info_offset..........00000000
processor_post_results_off..00000000
system_vpd_offset...........00000000
mem_data_offset.............00000000
l2_data_offset..............00000D7C
fddi_post_results_offset....00000000
golden_vpd_offset...........00000000
nvram_cache_offset..........00000000
user_struct_offset..........00000000
residual_offset.............00000E3C
numatopo_offset.............00000E3C
ipl_cb_and_bit_map_size....00008898
bit_map_size...............00000007
ipl_info_size..............00000598
iocc_post_results_size.....00000000
nio_dskt_post_results_size.00000000
sjl_disk_post_results_size.00000000
scsi_post_results_size.....00000000
eth_post_results_size......00000000
tok_post_results_size......00000000
ser_post_results_size......00000000
par_post_results_size......00000000
rsc_post_results_size......00000000
lega_post_results_size.....00000000
keybd_post_results_size....00000000
ram_post_results_size......00000000
sga_post_results_size......00000000
fm2_post_results_size......00000000
net_boot_results_size......00000000
csc_results_size...........00000000
menu_results_size..........00000000
console_results_size.......00000000
diag_results_size..........00000000
rom_scan_size..............00000000
sky_post_results_size......00000000
global_size................00000000
mouse_size.................00000000
vrs_size...................00000000
taur_post_results_size.....00000000
ent_post_results_size......00000000
vrs40_size.................00000000
system_info_size...........0000009C
buc_info_size..............00000150
processor_info_size........00000310
fm2_io_info_size...........00000000
processor_post_results_size00000000
system_vpd_size............00000000
mem_data_size..............00000000
l2_data_size...............000000C0
fddi_post_results_size.....00000000
golden_vpd_size............00000000
nvram_cache_size...........00000000
user_struct_size...........00000000
residual_size..............0000776C
numatopo_size..............00000000
Continued on next page
-32 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Checkpoint
Introduction
Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Quiz
1. Where is the softros located ?
2. what are the four common parts of the boot image across Power
platforms ?
3. What is the difference between the RSPC and the CHRP at the very
platforms of the boot image ?
4. In which logical volume is located the boot record ?
5. Who builds the IPLCB on the 3 Power platforms ?
6. What is the difference between the RS6K and the other Power
architectures in the boot record
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Instructor Notes
Purpose
Notes on Quiz and transition to the next section
Quiz
responses
The responses for the Quiz are :
1. Where is the located the softros :
• after the header in the boot logical volume
2. what are the four common parts of the boot image across Power
platforms :
• bootexpand
• kernel
• ramfs
• saved base
3. What is the difference between the RSPC and the CHRP at the very
begening of the boot image
• RSPC use an hints structure
• CHRP use an ELF header
4. In which logical volume is located the boot record
• None, the bootrecord is located at the very beginning of the disk
5. Who build the IPLCB ?
• ROS on RS6K
• Softros on CHRP and RSPC
6. What is the difference between RS6K and other Power platforms in the
boot record ?
• The RS6K doesn’t use the boot partition table
Transition
Statement
Now we will describe:
• the IA-64 specific boot process if this is not an only Power course
-34 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
The IA-64 Boot Mechanism
Introduction
The section will explain the boot mechanism used by the IA-64 platform.
Definitions
EFI stands for Extensible Firmware Interface. EFI provide a standard
interface between the Hardware and the operating system on IA-64
platforms.
Boot overview
When the system is powered on, the EFI will load first.
EFI will load BIOS for devices that needs.
EFI will then prompt to enter the setup for a timeout period.
EFI will then prompt the EFI boot menu for another timeout period after
witch he will scan the bootlist in order to find a boot device.
The EFI boot loader will prompt for the boot loader menu and after the
timeout or exit from the menu initialize the IPL Control Block.
Then it will locate and load the kernel that will initialize.
The kernel will then call init (In fact /usr/lib/boot/ssh at this stage)
The ssh will then call rc.boot for Phase I and Phase II specific to each boot
device types.
Then init will execute rc.boot Phase III and the remaining common code in
rc.boot for disk and network boot devices
If no boot device is found EFI will start the EFI Shell on IA-64 platforms that
supports EFI shell.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-35 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
The IA-64 Boot Mechanism
Boot diagram
-- continued
The following diagram represent the high level boot process overview.
execution of EFI firmware
Load needed BIOS.
Prompt for Setup
prompt no
timeout or
os boot
request
setup
menu
yes
EFI boot
manager menu
boot
yes boot maintenance
maint
manager
manager menu
request
no
scan the
boot list
valid
boot
device
found
no
EFI
Shell
yes
AIX boot loader
key
entered
during
timeout
yes
AIX boot
loader menu
no
Kernel initialization
Kernel call init (/usr/lib/boot/ssh)
init ssh call rc.boot PHASE I&II
init exit to newroot
init calls rc.boot PHASE III from
inittab and the rest of inittab entries
-36 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
IA-64 boot disk layout
Boot image
overview
The following represent the overview of an AIX 5L on IA-64 boot image :
hdisk0_all
hdisk0
hd5
kernel
hdisk0_s0
RAM Filesystem
VGDA
base
customized
data
PMBR,EFI Partition
Header and entries
PMBR, EFI
Partition
Header and
entries
rest
of the
hdisk0
EFI boot
loader
On IA-64 platform, AIX 5L must be aware of EFI disk partitioning.
During installation, two partitions will be created on the target disk
(hdisk0_all) :
• A Physical Volume partition (hdisk0 in the AIX environment) known as a
block device in the EFI environment (blkXX).
• An IA-64 System partition (hdisk0_s0 in the AIX environment) known as
an IA-64 System partition in the EFI environment (fsXX)
kernel
On IA-64 platform the 64 bit kernel (unix_ia64) can be used as the kernel
for either UP or MP systems. The kernel initializes itself and then passes
control to the simple shell init (ssh) in the RAM filesystem.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-37 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
IA-64 boot disk layout
-- continued
RAM
filesystem
Filesystem used during boot process, that contains programs and data for
initializing devices and subsystems in order to install AIX, execute
diagnostics, or to access and bring up the rest of AIX.
base
customized
data
Area of the hard disk boot logical volume containing user configured ODM
device configuration information that is used by the system configuration
process.
EFI boot
loader
The EFI boot loader will reside in am IA-64 System Partition physically
located after the Physical Volume Partition by the installation process.
-38 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
EFI boot manager and boot maintenance manager overview
Introduction
At boot time, EFI will prompt for the EFI boot manager menu to be entered
for a timeout period.
The timeout period is customizable via the boot maintenance menu.
boot manager
At boot time, the boot manager will display the bootlist and prompt for a
time out period.
If the timeout is reached, the boot manager will scan the bootlist in the boot
order to find a valid boot device.
If a key is entered before the timeout period, the user will be able to :
• select a boot device from the list to boot for this session
• start EFI Shell on platform that support EFI Shell
• enter the boot maintenance manager
boot
maintenance
manager menu
The boot maintenance manager menu will allow the administrator to :
• boot from a file
• add/delete boot options
• change boot order
• manage boot next setting
• set autoboot timeout
• select active console output devices (output,input and error)
• do a cold reset.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-39 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
EFI Shell Overview
Introduction
The EFI Shell allow you to configure the boot process used by the IA-64
platform. The main functions are to :
• Locate and identify different boot devices
• Set environment variable
• Use debugging sub commands
• boot from the selected boot device
EFI Shell
startup
example
The EFI shell startup will display informations about the current EFI level
and device mapping as follow :
EFI version x.xx [xx.xx] Build flags : EIF64 Running on Merced EFI_DEBUG
EFI IA-64 SDV/FDK (BIOS CallBacks) [Fri Mar 31 13:21:32 2000] - INTEL
Cache Enabled. This image Main entry is at address 000000003F2BA000
Stack
= 000000003F2B6FF0
BSP
= 000000003F293000
INT Stack = 000000003F292FF0
INT BSP
= 000000003F26F000
EFI Shell version x.xx [xx.xx]
Device mapping table
fs0
: VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk0
: VenHw(Unknown Device:01)/HD
blk1
: VenHw(Unknown Device:80)/HD
blk2
: VenHw(Unknown Device:81)/HD
blk3
: VenHw(Unknown Device:ff)/HD
blk4
: VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk5
: VenHw(Unknown Device:80)/HD(Part2,Sig0CBCBA54)
EFI Shell sub
commands
In the EFI Shell you will be able to use the following sub commands :
sub command
Description
help [internal command]
Display this help
guid [sname]
Dump known guid ids
set [-d] [sname] [value]
Set/get environment variable
alias [-d] [sname] [value]
Set/get alias settings
dh [-p prot_id] | [handle]
Dump handle info
map [-dvr] [sname[:]] [handle]
Map shortname to device path
mount BlkDevice [sname[:]]
Mount a filesystem on a block device
cd [path]
Updates the current directory
echo [[-on | -off] \ [text]
Echo text to stdout or toggle script echo
endfor
Script-only: Delimiter for loop construct
pause
Script-only: Prompt to quit or continue
Continued on next page
-40 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
EFI Shell Overview
-- continued
EFI Shell sub
commands
continued
sub command
ls [dir] [dir]...
mkdir [dir][dir]....
if [not] condition then
endif
goto label
for var in <set>
mode [row col]
cp file [file] ... dest
comp file1 file2
rm file/dir [file/dir]
memmap
type [-a] [-u] file
dmpstore
load driver_name
ver
err [level]
time [hh:mm:ss]
date [mm/dd/yyyy]
stall microseconds
reset [/warm] [reset string]
vol fs [Volume Label]
attrib [+/- rhs] [filename]
cls [background color]
dnlk device [Lba] [Blocks]
pci [bus dev] [func]
mm Address [Width] [;Type]
mem [Address] [size]
[;MMIO]
Description
Obtain directory listing
Make directory
Script-only: IF THEN construct
Script-only: Delimiter for IF THEN
construct
Script-only: Jump to label location in
script
Script-only: Loop construct
Set/get current text mode
Copy files/directories
Compare two files
Remove file/directories
Dumps memory map
Type file
Dumps variable store
Loads a driver
Displays version info
Set or display error level
Set or display time
Set or display date
Delay for x microseconds
Cold or Warm reset
Set or display volume label
View/sets file attributes
Clear screen
Hex dump of BlkIo Devices
Dsiplay pci device(s) info
Memory modify: Mem, MMIO, IO, PCI
Dump Memory or Memory Mapped IO
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-41 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
EFI Shell Overview
-- continued
EFI Shell sub
commands
continued
sub command
Description
Configures boot driver & load options
bcfg -?
edit [file name]
Edd30 [On|Off]
Enable or Disable EDD 3.0 Device paths
unload [-nv]
EddDebug [blockdevicename] Debug of EDD info from adapter card
EFI Shell
examples
The following is an example of the EFI Shell use :
Shell> map <== show the current device mapping
fs0
: VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk0
: VenHw(Unknown Device:01)/HD
blk1
: VenHw(Unknown Device:80)/HD
blk2
: VenHw(Unknown Device:81)/HD
blk3
: VenHw(Unknown Device:ff)/HD
blk4
: VenHw(Unknown Device:80)/HD(Part1,Sig0CBCBA54)
blk5
: VenHw(Unknown Device:80)/HD(Part2,Sig0CBCBA54)
Shell> pci <== list the pci devices
Bus
Dev
Func
Description
0
0
0
==> Generic System Peripheral - Interrupt Controller
Vendor 0x8086 Device 0x123D Program Interface 20
0
2
0
==> Mass Storage Controller
- SCSI Bus
Vendor 0x1077 Device 0x1280 Program Interface 0
0
3
0
==> PCI Bridge Device
- ISA
Vendor 0x8086 Device 0x7600 Program Interface 0
0
3
1
==> Mass Storage Controller
- IDE
Vendor 0x8086 Device 0x7601 Program Interface 80
0
3
2
==> Serial Bus Controller
- USB
Vendor 0x8086 Device 0x7602 Program Interface 0
0
3
3
==> Serial BUS Controller
- SMBUS
Vendor 0x8086 Device 0x7603 Program Interface 0
.
.
.
Shell> fs0: <== change to fs0
fs0:>dir <== list the content of fs0
XX/XX/XX
01:05p
<DIR>
512
aix
XX/XX/XX
01:10p
279,792
a.out
XX/XX/XX
01:11p
23,636
boot.efi
fs0:>boot <== boot from fs0
-42 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
IA-64 Boot Loader
Introduction
The AIX 5L EFI boot loader makes the interface between EFI and the
kernel.
On disk drives, the AIX boot loader is located in the system partition.
Before loading the kernel, the boot loader will prompt the user to enter the
boot loader menu.
Then the boot loader will make use of EFI interface to initialize the IPL
Control Block.
The boot loader will then locate the kernel that reside in the hd5 that
actually is contained in the AIX PV partition.
Finally the boot loader will pass control to the kernel entry point.
boot loader
and EFI
interactions
The boot loader will make use of all the EFI boot services to load file
images such as kernel, RAM filesystem file and base customized data and
to locate various system tables such as System Abstraction Layer (SAL)
System Table (SST) and Advanced Configuration and Power Interface
(ACPI) Specification Tables. The boot loader will then create Initial
Program Load Control Block (IPLCB) and setup Translation Registers (TR)
before transferring control to kernel’s entry point.
EFI boot
loader menu
The boot loader menu can be used to set parameters that may affect the
kernel loading and operating environment like :
• enable the kernel debugger
• invoke the kernel debugger
• override RMALLOC memory reservation
• set boot loader debug flag
• set service/diagnostics flag
• select the amount of memory to enable
• Set the number of cpu to use
• select the number of CPU to use
• Toggle Single/Multi dispersal mode
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-43 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
IA-64 Initial Program Load Control Block
Introduction
The IPLCB (Initial Program Load Control Block) defines the RAM resident
interface between the IPL Boot Process and the Operating System. The
boot loader will initialize the IPLCB structure using interfaces to EFI.
The kernel when loaded will use the IPLCB structure to initialize it’s
runtime structures.
IPLCB
Description
The IPLCB contains the following structures (described in : /usr/include/
sys/iplcb.h) :
• IPLCB Directory : contains the IPLCB ID and pointers (offset and size to
IPLCB Data)
• IPLCB Data such :
• IPLCB Hand off information
• IPLCB IPL information
• IPLCB system information
• IPLCB processor information
• I/O XAPIC Information
• Memory Information and Memory regions.
IPLCB
directory
example on a
IA64 system
The following screen shows the IPLCB Directory on a IA-64 system
captured using the IADB iplcb -dir sub command :
> iplcb -dir
Directory Information
ipl_control_block_id......................=
ipl_cb_and_bit_map_offset.................=
ipl_cb_and_bit_map_size...................=
bit_map_offset............................=
bit_map_size..............................=
ipl_info_offset...........................=
ipl_info_size.............................=
system_info_offset........................=
system_info_size..........................=
processor_info_offset.....................=
processor_info_size.......................=
io_xapic_info_offset......................=
io_xapic_info_size........................=
handoff_info_offset.......................=
handoff_info_size.........................=
platform_int_info_offset..................=
platform_int_size.........................=
residual_offset...........................=
residual_size.............................=
-44 of 72 AIX 5L Internals
IA64_IPL
0x0
0x7F0
0x448
0x27
0xD8
0x7C
0x3D8
0x50
0x250
0x188
0x428
0x18
0x158
0xF0
0x440
0x8
0x0
0x0
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Checkpoint
Introduction
Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Quiz
7. In which partition is located the aix boot loader ?
8. What is the equivalent of fs0 partition in the AIX environment ?
9. In which partition is located the IA-64 boot record ?
10. In which partition is located the IA-64 boot image ?
11. Where is the bootexpand located on IA-64 ?
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-45 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Instructor Notes
Purpose
Checkpoint for rc.boot results
Answers
1. In which partition is located the aix boot loader ?
• The boot loader is located in fs0:
2. What is the equivalent of fs0 partition in the AIX environment ?
• the equivalent is hdiskxx_s0
3. In which partition is located the IA-64 boot record ?
• no boot record on IA-64
4. In which partition is located the IA-64 boot image ?
• the boot image is located in hd5that in fact resides in the rootvg PV
partition of the disk (blk5 in our example)
5. Where is the bootexpand located on IA-64 ?
• no bootexpand on IA-64
-46 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Hard Disk Boot process (rc.boot Phase I)
Introduction
The main goal here is to get the devices configured and odm initialized
Hard disk
Phase I
diagram
The following chart represent the hard disk boot phase I process
restore base configuration
from boot disk
restbase
return
code
<>0
led 548
0
led 510
configuration manager Phase I
run bootinfo -b to get boot device
link boot device to /dev/ipldevice
led 511
exit 0
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-47 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Hard Disk Boot process (rc.boot Phase II)
Introduction
The main objective in hard disk boot phase II is to varyon rootvg and
mount standard filesystems.
Hard disk
Phase II
diagram
The following chart represent the hard disk boot phase II process
led 511
ipl_varyon -v
ipl
varyon
return
code
<>0
led 552,554 or 556
0
led 517
fsck and mount aix filesystems
on /mnt.
check for dump in hd6
swapon hd6 if no dump present
run savebase recovery procedure
key
service
or dump
in hd6
yes
execute the
service
procedure
no
copy /etc/vg and
objrepos to disk
merge devices
unmount filesystems
remount filesystems
led 553
exit 0
-48 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Hard Disk Boot process (rc.boot Phase III)
Introduction
The main objective in hard disk boot phase III is to mount runtime /tmp,
sync rootvg and then fall down the phase III common process.
Hard disk
Phase III
diagram
The following chart represent the hard disk boot phase III process
fsck and mount /tmp
syncvg rootvg
continue phase III
common code
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-49 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
CDROM Boot process (rc.boot Phases I, II and III)
Introduction
The main objective of the CDROM boot process is to configure devices
needed for installation and maintenance procedures and start the bi_main
process.
CDROM boot
phases I,II and
III diagram
The following chart shows the CDROM boot phases I,II and III
configuration manager
Phase I
led 517
Mount the cdrom spot
1
Phase
number
3
exit 0
2
exec bi_main
exit 0
led 512
recreate the ramfs
from the SPOT
led 510
configure remaining
devices needed for install
led 511
exit 0
-50 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Tape Boot process (rc.boot Phases I, II and III)
Introduction
The main objective of the Tape boot process is to configure devices
needed for installation and maintenance procedures and start the
bi_main process.
Tape boot
phases I,II and
III diagram
The following chart shows the Tape boot phases I,II and III
1
Phase
number
led 510
configuration manager
Phase I
3
exit 0
2
exec bi_main
configuration manager
Phase II
exit 0
led 512
Change all tape
devices block_sizes to 512
Cleanup links
Cleanup ODM and rebuild
exit 0
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-51 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Network Boot process (rc.boot Phases I, II and III)
Introduction
The main objective of the Network boot process is to configure devices,
configure additional network options (network address, mask and default
route) and run the $RC_CONFIG script.
Network boot
phases I,II and
III diagram
The following chart shows the Network boot phases I,II and III
set nim debug if
needed
1
Phase
number
continue phase III
common code
led 600
2
boot
from
atm0
yes
3
set nim debug if needed
set nim environment
run $RC_CONFIG
no
exit 0
exit 0
restbase
save ATM datas
Clear ODM
configuration manager phase I
boot
from
atm0
yes
no
configure ATM
pvc, svc and
muxatmd
rc
=0
yes
configure the
native network
bootdevice (ifconfig)
no
rc
=0
led 607
led 607
yes
tftp miniroot
set nim environment
create /etc/hosts and routes
nfs mount the SPOT
run $RC_CONFIG from SPOT
-52 of 72 AIX 5L Internals
no
exit 0
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Common Boot process (rc.boot Phase III)
Introduction
The common Phase III boot code is run for disk and network boot only.
Common boot
Phase III
diagram
The following chart shows the common boot phases III process
ensure 1024K free space in /tmp
load streams modules
fix the secondary dump device
swapon hd6 if no dump present
run savebase recovery procedure
key
is in
service
yes
config manager phase III
disable controlling tty
no
clean odm for alt disk install
config manager phase II
setup System Hang Detection
run graphical boot if needed
run savebase
clean unavailable tty from inittab
sync the files to hard disk
run /etc/rc.B1 if exists
start the syncd daemon
start the errdaemon daemon
clean /etc/locks and /etc/nologin
start mirrord daemon
start cfgchk daemon
run diagsrv if supported by platform
System initialization completed
exit 0
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-53 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Network boot $RC_CONFIG files
Introduction
As seen in the Network Boot Process (Phases I, II and III) these scripts are
ran by rc.boot when booting from a network device in phases I and II.
These script are located in the /usr/lib/boot/network directory.
They are loaded from the SPOT on the NIM server during the network boot
process.
rc.config types
There are 3 types of rc.config files :
• rc.bos_inst : Used to configure a system for AIX installation
• rc.dd_boot : Used for network boot of diskless or dataless systems
• rc.diag : Used for booting to diagnostics
rc.bos_inst
This script will :
• Phase I :
• Mount resources listed in niminfo as ${NIM_MOUNTS}
• Enable NIM debug if needed
• link necessary methods from the SPOT
• run configuration manager
• Phase II :
• Set some tcpip parameters
• enable diagnostics for pre-install diagnostics on disks
• execute bi_main
Continued on next page
-54 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Network boot $RC_CONFIG files
rc.dd_boot
Guide
-- continued
This script will :
• Phase I :
• remove link from /lib to /usr/lib and populate /lib with hard links to /usr
to ensure the use of RAM libraries
• Mount the root directory
• get niminfo file
• unconfigure network services (ifconfig and routes)
• run configuration manager phase I
• reconfigure the network using nim informations
• mount /usr
• activate the local or remote paging spaces
• issue mergedev
• unmount all remote filesystems
• Phase II :
• mount types dd_boot filesystems
• clean up unused shared libraries
• set the hostname
rc.diag
This script will :
• Phase I:
• Mount resources list in niminfo as ${NIM_MOUNTS}
• Enable NIM debug if needed
• link necessary methods from the SPOT
• run configuration manager
• Phase II :
• configure the console
• if graphic console configure gxme0 and rcm0
• For RSPC and CHRP start, sleep 2 and stop the errdaemon to get
errors since last boot
• Execute diag pretest before running diag
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-55 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
The init process
Introduction
The init initializes and controls AIX processes.
The boot process, when running from the RAM filesystem (Phases I and
II), doesn’t use the real init command but /usr/lib/boot/ssh.
This strategy allows for more efficient use of the system resources during
boot
The real init is found in /usr/sbin/init. The real init begins during the kernel
newroot, which occurs at the end of Phase II of rc.boot.
The real init will use the /etc/inittab file to start AIX processes and run
system environment initialization scripts
/etc/inittab
Here is a example of the inittab file :
init:2:initdefault:
brc::sysinit:/sbin/rc.boot 3 0</dev/console >/dev/console 2>&1
powerfail::powerfail:/etc/rc.powerfail 0</dev/console >/dev/
console 2>&1 # Power Failure Detection
rc:2:wait:/etc/rc
0</dev/console >/dev/console 2>&1
fbcheck:2:wait:/usr/sbin/fbcheck 0</dev/console >/dev/console
2>&1 # Run /etc/firstboot
srcmstr:2:respawn:/usr/sbin/srcmstr # System Resource Controller
rctcpip:2:wait:/etc/rc.tcpip > /dev/console 2>&1 # Start TCP/IP
daemons
rcnfs:2:wait:/etc/rc.nfs > /dev/console 2>&1 # Start NFS Daemons
cron:2:respawn:/usr/sbin/cron
cons:0123456789:respawn:/usr/sbin/getty /dev/console
writesrv:2:wait:/usr/bin/startsrc -swritesrv
uprintfd:2:respawn:/usr/sbin/uprintfd
shdaemon:2:off:/usr/sbin/shdaemon >/dev/console 2>&1 # High
availability daemon
logsymp:2:once:/usr/lib/ras/logsymptom # for system dumps
lft:2:respawn:/usr/sbin/getty /dev/lft0
-56 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
ODM Structure and usage
Introduction
The Object Data Manager is widely used in AIX to store and retrieve
various system informations.
For this purpose, AIX defines number of standard ODM classes.
Any application can create an use it’s own ODM classes to manage it’s
own informations.
AIX
Informations
managed by
ODM
AIX System data managed by ODM includes:
• Device configuration information
• Display information for SMIT (menus, selectors, and dialogs)
• Vital product data for installation and update procedures
• Diagnostics informations
• System resource information.
• RAS informations
Devices ODM
Classes
The Devices classes are used by the configuration manager, device
drivers and AIX device related commands (lsdev, lsattr ,lspv ,lsvg ...).
The following table list the Devices ODM classes and their definitions :
Class
PdDv
PdCn
PdAt
PdAtXtd
Config_Rules
CuDv
CuDep
CuAt
CuDvDr
CuVPD
CuPart
CuPath
CuPathAt
Definition
Predefined Devices
Predefined Connection
Predefined Attribute
Extended Predefined Attribute
Configuration Rules
Customized Devices
Customized Dependency
Customized Attribute
Customized Device Driver
Customized Vital Product Data
EFI partitions
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-57 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
ODM Structure and usage
SWVPD ODM
Classes
-- continued
The SWVPD classes are used by fileset related commands like installp,
instfix, lslpp, oslevel.
SWVPD is divided in 3 parts :
• root : classes are in /etc/objrepos
• usr : classes are in /usr/lib/objrepos
• share : classes are located in /usr/share/lib/objrepos
The following table list the Software Vital Product Data ODM classes and
their definitions :
Class
lpp
inventory
history
product
SRC ODM
Classes
Definition
The lpp object class contains information about the installed
software products, including the current software product
state.
The inventory object class contains information about the files
associated with a software product.
The history object class contains historical information about
the installation and updates of software products.
The product object class contains product information about
the installation and updates of software products and their
prerequisites.
SRC Classes are used by the srcmstr and related commands : lssrc,
startsrc, stopsrc and chssys.
The following table list the System Resource Controller ODM classes and
their definitions
Class
SRCsubsys
SRCsubsvr
SRCnotify
Definition
The subsystem object class contains the descriptors for all
SRC subsystems. A subsystem must be configured in this
class before it can be recognized by the SRC.
An object must be configured in this class if a subsystem
has subservers and the subsystem expects to receive
subserver-related commands from the srcmstr daemon.
This class provides a mechanism for the srcmstr daemon
to invoke subsystem-provided routines when the failure of a
subsystem is detected.
SRCextmeth
Continued on next page
-58 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
ODM Structure and usage
SMIT ODM
Classes
Guide
-- continued
The SMIT odm classes are used by smit and smitty commands.
The following table list the SMIT ODM classes and their definitions
Use
smit menu
Class
Definition
sm_menu_opt 1 for title of screen
1 for first item
1 for second item
1 for last item
smit selector sm_name_hdr 1 for title of screen and other attributes
1 for entry field or pop-up list
smit selector sm_cmd_opt
1 for entry field or pop-up list
smit dialog sm_cmd_hdr 1 for title of screen and command string
smit dialog sm_cmd_opt
1 for first entry field 1 for second entry field
...
1 for last entry field
RAS ODM
Classes
The RAS classes are used by the errdaemon, shdaemon, shconf and alog
commands.
The following table list the RAS ODM classes and their definitions
Class
Definition
errnotify
Used by errlog notification process
SWservAt Used by errorlog, system dumps, System
Hang Detection and alog
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-59 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
ODM Structure and usage
Diagnostics
ODM Classes
The diagnostics classes are used by the diag command.
The following table list the Diagnostics ODM classes and their definitions
Class
PDiagRes
PDiagAtt
PDiagTask
CDiagAtt
TMInput
MenuGoal
FRUB
FRUs
DAVars
PDiagDev
DSMOptions
ODM
commands
-- continued
Definition
Predefined Diagnostic Resource Object Class
Predefined Diagnostic Attribute Device Object Class
Predefined Diagnostic Task Object Class
Customized Diagnostic Attribute Object Class
Test Mode Input Object Class
Menu Goal Object Class
Fru Bucket Object Class
Fru Reporting Object Class
Diagnostic Application Variables Object Class
Predefined Diagnostic Devices Object Class
Diagnostic Supervisor Menu Options Object Class
The following table list the ODM commands and their usage:
Command
Definition
odmadd
Adds objects to an object class. The odmadd command
takes an ASCII stanza file as input and populates object
classes with objects found in the stanza file.
odmchange Changes specific objects in a specified object class.
odmcreate Creates empty object classes. The odmcreate command
takes an ASCII file describing object classes as input and
produces C language .h and .c files to be used by the
application accessing objects in those object classes.
odmdelete Removes objects from an object class.
odmdrop
Removes an entire object class.
odmget
Retrieves objects from object classes and puts the object
information into odmadd command format.
odmshow
Displays the description of an object class. The
odmshow command takes an object class name as input
and puts the object class information into odmcreate
command format.
Continued on next page
-60 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
ODM Structure and Usage
ODM
subroutines
-- continued
The following table list the odm subroutines and their use :
subroutine
odm_add_obj
odm_change_obj
odm_close_class
odm_create_class
odm_err_msg
odm_free_list
definition
Adds a new object to the object class.
Changes the contents of an object.
Closes an object class.
Creates an empty object class.
Retrieves a message string.
Frees memory allocated for the odm_get_list
subroutine.
odm_get_by_id
Retrieves an object by specifying its ID.
odm_get_first
Retrieves the first object that matches the specified
criteria in an object class.
odm_get_list
Retrieves a list of objects that match the specified
criteria in an object class.
odm_get_next
Retrieves the next object that matches the specified
criteria in an object class.
odm_get_obj
Retrieves an object that matches the specified criteria
from an object class.
odm_initialize
Initializes an ODM session.
odm_lock
Locks an object class or group of classes.
odm_mount_class Retrieves the class symbol structure for the specified
object class.
odm_open_class
Opens an object class.
odm_rm_by_id
Removes an object by specifying its ID.
odm_rm_obj
Removes all objects that match the specified criteria
from the object class.
odm_run_method Invokes a method for the specified object.
odm_rm_class
Removes an object class.
odm_set_path
Sets the default path for locating object classes.
odm_unlock
Unlocks an object class or group of classes.
odm_terminate
Ends an ODM session.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-61 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
ODM Structure and Usage
ODM paths
-- continued
As the ODM classes can be found in 3 paths (root, usr and share), the user
must decide which path he will use before running ODM commands or
ODM subroutines.
For ODM commands, the user can set the path using :
# export ODMDIR=/usr/share/lib/objrepos
In a C program, the user should use :
odm_set_path("/usr/lib/objrepos");
-62 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
boot and installation logging facilities
Introduction
It can be useful to retrieve rapidly the logging files used for boot or
installation to help solve problems.
The alog command can be used to recover system logs
log types
The alog command is used by installation and boot processes to log
informations or errors for the following topics :
• boot : log for the boot process
• bosinst : log used for the AIX installation process
• console : log used to store console messages
• nim : log used to store NIM messages
• dumpsymp : used to store dump symptom messages
alog command
usage
The following alog commands may be used :
• alog -L : will list alog log types defined in the ODM
• alog -t <log_type> -o : will display the log file related to the log_type
• echo “Message xxx” | alog -t <boot_type> : will log the message to the
log file
• alog -L -t <log_type> : will display detailed information related to the
log_type definition (log file path, size and verbosity)
• alog -Cw <new_verbosity> -t <log_type> : will change the verbosity (09) for the log_type
• alog -C -t <log_type -s <new_size> -f <new_file> : will change the file
and file size the log_type.
• alog -V -t <log_type> : will display the current verbosity
example
The following example will output the 15 last lines of the boot log :
# alog -t boot -o|tail -15
Saving Base Customize Data to boot disk
Starting the sync daemon
Starting the error daemon
A device that was previously detected could not be found.
Run "diag -a" to update the system configuration.
System initialization completed.
Starting Multi-user Initialization
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-63 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Performing auto-varyon of Volume Groups
Activating all paging spaces
0517-075 swapon: Paging device /dev/hd6 is already active.
/dev/rhd1 (/home): ** Unmounted cleanly - Check suppressed
Performing all automatic mounts
Replaying log for /dev/lv01.
Multi-user initialization completed
-64 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Debugging boot problems using KDB
introduction
For boot problems debugging purposes, it can be useful to get a detailed
output of the boot process, including rc.boot outputs.
entering boot
debug
To enter the boot debugging, the administrator should first make sure the
KDB kernel debugger will be loaded invoked at boot time using :
# bosboot -I -ad/dev/ipldevice
The next reboot will launch the KDB on the native serial connection.
At the KDB prompt you will need to toggle the rc.boot debug flag and
optionally the exec debug flag in order to have rc.boot outputs at the native
serial connection.
Note that the exec tracing will continue after the end of the rc.boot.
example
The following is an example of a boot debug session :
.......... kdb_tty_init done
.......... kdb_init_flihs done
region address
region length nodeid att label
0000000000000000 0000000000FF1000 0000 01 01
0000000000FF1000 000000000000F000 0000 01 03
0000000001000000 0000000006FCC000 0000 01 01
0000000007FCC000 0000000000029000 0000 00 05
0000000007FF5000 000000000000B000 0000 01 02
0000000008000000 0000000018000000 0000 01 01
0000000020000000 FFFFFFFFE0000000 0000 00 07
Real memory size = 512 M Bytes
Model = 0800004C
Data cache size =
64 K Bytes
Inst cache size =
32 K Bytes
.......... kdb_mem_size done
.......... kdb_code_init done
Preserving 911247 bytes of symbol table
First symbol __mulh
START
END <name>
0000000000003500 0000000000DB55A8 _system_configuration+000020
F00000002FF3B400 F00000002FFC0818 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC _errno+000000
F100008080000000 F10000808A000000 pvproc+000000
F100008090000000 F100008094000000 pvthread+000000
F100000040000000 F100000040266C80 vmmdseg+000000
F1000013B0000000 F1000073B4800000 vmmswpft+000000
F100000BB0000000 F1000013B0000000 vmmswhat+000000
F100000050000000 F100000060000000 ptaseg+000000
F100000070000000 F1000000B0000000 ameseg+000000
F100009710000000 F100009720000000 KERN_heap+000000
F100009500000000 F100009510000000 lkwseg+000000
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-65 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Debugging boot problems using KDB
example
continued
-- continued
************* Welcome to KDB *************
Call gimmeabreak...
Static breakpoint:
.gimmeabreak+000000
tweq
r8,r8
r8=00000000F80003F8
.gimmeabreak+000004
blr
<.kdb_init+00021C> r3=0
KDB(0)> dbgopt <== Enter debug options
Debug options:
-------------1. Toggle rc.boot tracing - currently DISABLED
2. Toggle tracing of exec calls - currently DISABLED
q. Exit
Enter option: 1 <== Enable rc.boot tracing
Debug options:
-------------1. Toggle rc.boot tracing - currently ENABLED
2. Toggle tracing of exec calls - currently DISABLED
q. Exit
Enter option: 2 <== Enable exec calls tracing
Debug options:
-------------1. Toggle rc.boot tracing - currently ENABLED
2. Toggle tracing of exec calls - currently ENABLED
q. Exit
Enter option: q <== here, we quit the debug option menu
KDB(0)> q <== here, we quit the KDB so the boot process can pursue.
PFT:
id....................0007
raddr.....0000000001000000 eaddr.....0000000000000000
size..............00800000 align.............00800000
valid..1 ros....0 holes..0 io.....0 seg....1 wimg...2
PVT:
id....................0008
raddr.....0000000000692000 eaddr.....0000000000000000
size..............00100000 align.............00001000
valid..1 ros....0 holes..0 io.....0 seg....1 wimg...2
Exiting vmsi()
LED{814}
AIX Version 5.0
Starting NODE#000 physical CPU#002 as logical CPU#001... done.
exec(/etc/init)
exec(/usr/bin/sh,-c,/sbin/rc.boot 1)
exec(/sbin/rc.boot,/sbin/rc.boot,1)
+ [ 1 -ne 1 ]
+ PHASE=1
+ + bootinfo -p
exec(/usr/sbin/bootinfo,-p)
PLATFORM=chrp
-66 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Debugging boot problems using IADB
Introduction
For boot problems debugging purposes, it can be useful to get a detailed
output of the boot process, including rc.boot outputs.
Prerequisites
In order to get boot debug output you will need to have a device (TTY,
Thinkpad or an other system serial port) connected to the native serial port
and configured at 115200-8-N-1
Process
The following process will be used to debug boot problems :
Step
1
2
3
4
5
6
Action
If you want the IADB to be invoked at boot time, use :
# bosboot -I -ad /dev/ipldevice
You can also chose not to do this and set manually the
debugger flags on the boot loader menu
Boot or reboot the system
If you are using another system as the TTY, you may want
to set some tracing/capture options to capture the
debugging output.
If the autoboot flag is not set in EFI set the file system and
boot using :
Shell> fs0:
fs0> boot
The boot loader menu should come up with the debugger
flags set “ON” if you ran bosboot in step 1.
Otherwise, hit some key to enter the boot loader menu and
set the debugger flags. Then exit the boot loader menu
The boot loader will load the IADB that will prompt on the
native serial port.
At the IADB prompt type :
CPU0> set dbgmsg=on
CPU0> set exectrace=on
CPU0> go
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-67 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Debugging boot problems using IADB
boot
debugging
output
example
-- continued
The following example show the beginning of what you can see on the
native serial port when debugging the boot process :
MEDIEVAL DEBUGGER ENTERE interrupt.
IP->E00000000001D2F2 brkpoint()+2: { .mfi
0: nop.m
0x100001
;; }
>CPU0> set dbgmsg=on
>CPU0> set exectrace=on <== here we ask for debugging
>CPU0> go <== here we go
See Ya!
Performing Hostile Takeover of the System Console...
AIX Version 5.0
Starting CPU#001... done.
+ ODMSTRNG=attribute=keylock and value=service
+ HOME=/
+ LIBPATH=/usr/lib:/lib:/usr/sbin:/etc:/usr/bin
+ SHOWLED=showled
+ SYSCFG_PHASE=BOOT
+ export HOME LIBPATH ODMDIR PATH SHOWLED SYSCFG_PHASE
+ umask 077
+ set -x
+ [ 1 -ne 1 ]
+ PHASE=1
+ + bootinfo -p
PLATFORM=ia64
+ [ ! -x /usr/lib/boot/bin/bootinfo_ia64 ]
+ [ 1 -eq 1 ]
+ 1> /)
+ + bootinfo -t
BOOTYPE=3
+ [ 0 -ne 0 ]
+ [ -z 3 ]
+ unset pdev_to_ldev undolt
-68 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Packaging Changes
Introduction
The lpp packaging has been reviewed to reflect the need for platform
dependant packages.
Package
names
The Packages names have the following structure :
<pkg_name>.V.R.M.F.<plateform_type>.<install_type>.bff where :
• <pkg_name> is the name of the package to be installed
• V.R.M.F are the Version, Release, Modification and Fix levels of the
package
• <platform_type> is the platform type for which that package was
designed. The platform type can be one of :
• I : For Intel IA-64 platform
• N : For Neutral packages that can be installed on all platforms
• Nothing : For Power specific packages
Packaging
commands
installp, bffcreate, inutoc and instfix commands are updated to reflect
these changes.
By default packaging commands will process only packages related to the
platform where the command is ran.
A “-M” flag has been added to these command that accept the following
sub options :
• I : To process Intel related packages
• R : To process Power related packages
• N : To process Neutral related packages
• A : To process all kind of packages
installp
options
The installp command will only accept the -M flag with -l or -L options.
bffcreate
options
The bffcreate command will accept all -M sub options to allow transit of
packages regardless of the current platform. This is needed for nim
operations
instfix options
The instfix command like the installp command will only accept the -M flag
when used in conjunction with the -T (list flag).
installp option -L output will include platform informations
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-69 of 72
Guide
inutoc
command
Draft Version for review, Sunday, 15. October 2000, boot.fm
The inutoc command will accept the -M flag.
-70 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, boot.fm
Guide
Checkpoint
Introduction
Take a few minutes to answer the following questions. We will review the
questions as a group when everyone has finished.
Quiz
1. Who call rc.boot ?
2. What is common in phase II of tape, cdrom and network phase II ?
3. What is specific to the rc.boot phase III ?
4. What will you need to do if you want to modify something in rc.boot
phase I or II ?
5. What is the phase and/or device in rc.boot not supported on IA-64 ?
6. What is the usage of the ODM ?
7. What is init in the first two phases of the boot ?
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-71 of 72
Guide
Draft Version for review, Sunday, 15. October 2000, boot.fm
Instructor Notes
Purpose
<which objective does this map address >
Answers
1. Who call rc.boot ?
• init
2. What is common in phase II of tape,cdrom and network phase II ?
• They exec bi_main (rc.bos_inst for network) to run installation tasks
3. What is specific to the rc.boot phase III ?
• rc.boot phase III is called by the actual init process after newroot.
4. What will you need to do if you want to modify something in rc.boot
phase I or II ?
• You will need to run bosboot in order to copy your changed rc.boot to
the RAMFS
5. What is the phase and/or device in rc.boot not supported on IA-64 ?
• The tape boot device (this was said in the map various types of boot)
6. What is the usage of the ODM ?
• Store and retrieve system informations
7. What is init in the first two phases of the boot ?
• ssh
-72 of 72 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
Unit 15. /proc Filesystem Support
This unit describes: The /proc filesystem in the AIX 5L kernel.
What You Should Be Able to Do
After completing this unit, you should be able to
• List the directories and files that are found in the /proc
filesystem
• Describe the basic functionality of each file in the sub-directory
tree for a specific process
• Create a simple C program to access the files belonging to
another process
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-1 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
/proc Filesystem Support
Introduction
/proc is a file system that provides access to the state of each active
process and Light Weight Process (LWP) in the system.
Platform
This lesson is platform independent
/proc
filesystem
The contents of the /proc filesystem have the same appearance as any
other file and directory in a Unix filesystem. The name of each top-level
entry in the /proc directory is a sub-directory, named by the decimal
number corresponding to the process ID, and the owner of each is
determined by the user-ID of the process.
Access to process state is provided by additional files contained within
each sub-directory; this hierarchy is described more completely below.
Except where otherwise specified, ‘‘/proc file’’ is meant to refer to a nondirectory file within the hierarchy rooted at /proc.
Filesystem
heirarchy
The directory structure for the proc directory is described below. The pid
represents the process ID number and the lwp# represents the light-weight
process number.
File/Directory Name
/proc
/proc/pid
/proc/pid/status
/proc/pid/ctl
/proc/pid/psinfo
/proc/pid/as
/proc/pid/map
/proc/pid/object
/proc/pid/sigact
/proc/pid/lwp/lwp#
/proc/pid/lwp/lwp#/lwpstatus
/proc/pid/lwp/lwp#/lwpctl
/proc/pid/lwp/lwp#/lwpsinfo
Description
directory - list of processes
directory for process pid
status of process pid
control file for process pid
ps info for process pid
address space of process pid
as map info for process pid
directory for objects for process pid
signal actions for process pid
directory for LWP lwp#
status of LWP lwp#
control file for LWP lwp#
ps info for LWP lwp#
Continued on next page
-2 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
/proc Filesystem Support
Accessing /
proc files
Guide
-- continued
Standard system call interfaces are used to access /proc files: open(2),
close(2), read(2), and write(2). Most files describe process state and can
only be open for reading. An open for writing allows process control; a
read-only open allows inspection, but not control.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-3 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
Types of Files
Introduction
Listed below are descriptions of the files that are contained in the /proc
filesystem heirarchy. These files are described in more detail on the
following pages.
Filename
as
Mode
rd/wr
ctl
wr
status
rd
psinfo
rd
map
rd
cred
rd
sigact
rd
object
N/A
lwp
lwp#/lwpstatus
lwp#/lwpctl
N/A
rd
wr
lwp#/lwpsinfo
??
-4 of 36 AIX 5L Internals
Function
Contains the address-space
image of the process
Allows change to the process
state or behaviour
Contains state information about
the process
Information about the process
needed by the ps(1) command
Information about the virtual
address map of the process
Describes the credentials
associated with the process
Describes the disposition of all
signals associated with the
process
A directory containing read-only
files with names as they appear
in the map file
A directory for LWP
State information for LWP lwp#
Allows change to the LWP
process state or behaviour of
LWP lwp#
Process info for LWP lpw#
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
The as File
Introduction
The as file contains the address-space image of the process and can be
opened for both reading and writing.
Accessing the
file
lseek is used to position the file at the virtual address of interest and then
the address space can be examined or changed through a read or write.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-5 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The ctl File
Introduction
The ctl file is a write-only file to which structured messages are written
directing the system to change some aspect of the process’s state or
control its behavior in some way. The seek offset is not relevant when
writing to this file.
Control
messages
Individual LWPs also have associated lwpctl files. Process state changes
are effected through control messages written to either to the ctl file of the
process or to a specific lwpctl file. All control messages consist of an int
naming the specific operation followed by additional data containing
operands (if any). The effect of a control message is immediately reflected
in the state of the process visible through appropriate status and
information files.
Multiple control messages can be combined in a single write(2) to a control
file, but no partial writes are permitted; that is, each control message
(operation code plus operands) must be presented in its entirety to the
write and not in pieces over several system calls.
Descriptions
of control
messages
Descriptions of allowable control messages are included on page 20.
-6 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
The status File
Introduction
The status file contains state information about the process and one of its
LWPs (chosen according to the rules described below).
File format
The file is formatted as a struct pstatus containing the following members:
long
pr_flags;
ushort_t pr_nlwp;
sigset_t pr_sigpend;
vaddr_t pr_brkbase;
ulong_t pr_brksize;
vaddr_t pr_stkbase;
ulong_t pr_stksize;
pid_t
pr_pid;
pid_t
pr_ppid;
pid_t
pr_pgid;
pid_t
pr_sid;
timestruc_t pr_utime;
timestruc_t pr_stime;
timestruc_t pr_cutime;
timestruc_t pr_cstime;
sigset_t pr_sigtrace;
fltset_t pr_flttrace;
sysset_t pr_sysentry;
sysset_t pr_sysexit;
lwpstatus_t pr_lwp;
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
Flags */
Total number of lwps in the process */
Set of process pending signals */
Address of the process heap */
Size of the process heap, in bytes */
Address of the process stack */
Size of the process stack, in bytes */
Process id */
Parent process id */
Process group id */
Session id */
Process user cpu time */
Process system cpu time */
Sum of children’s user times */
Sum of children’s system times */
Mask of traced signals */
Mask of traced faults */
Mask of system calls traced on entry */
Mask of system calls traced on exit */
"representative" LWP */
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-7 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The status File
Member
description
-- continued
Here is a description of members of the status file:
Member
pr_flags
pr_nwlp
pr_brkbase
pr_brksize
pr_stkbase
pr_stksize
pr_pid
pr_ppid
pr_pgid
pr_sid
pr_utime
pr_stime
pr_cutime
pr_cstime
pr_sigtrace
pr_flttrace
pr_sysentry
pr_sysexit
pr_lwp
Description
A bit mask holding flags (flags are described below)
Total number of LWPs in the process
Virtual address of the process heap
Size of process heap in bytes. The address formed by
the sum of these values is the process break (see
brk(2)).
Virtual address of the process stack
Size of the process stack in bytes. Each LWP runs on a
separate stack; the process stack is distinguished in that
the operating system will grow as necessary.
Process ID
Parent process ID
Process group ID
Session ID of the process
User CPU time consumed by the process in seconds
and nanoseconds
System CPU time consumed by the process in seconds
and nanoseconds
Cumulative user CPU time consumed by the process in
seconds and nanoseconds
Cumulative system CPU time consumed by the process
in seconds and nanoseconds
Set of signals that are being traced (see PCSTRACE)
Set of hardware faults that are being traced (see
PCSFAULT)
Set of system calls being traced on entry (see
PCSENTRY)
Set of system calls being traced on exit (see PCSEXIT)
If the process is not a zombie, pr_lwp contains an
lwpstatus_t structure describing a representative LWP.
The contents of this structure ave the same meanin as if
it were read from an lwpstatus file.
Continued on next page
-8 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
The status File
pr_flags
-- continued
pr_flags is a bit-mask holding these flags:
Flag
PR_ISSYS
PR_FORK
PR_RLC
PR_KLC
PR-ASYNC
Multi-threaded
applications
Guide
Description
System process (see PCSTOP)
Has its inherit-on-fork flag set (see PCSET)
Has its run-on-last-close flag set (see PCSET)
Has its kill-on-last-close flag set (see PCSET)
Has its asynchronous-stop flag set (see PCSET)
When the process has more than one LWP, its representative LWP is
chosen by the /proc implementation. The chosen LWP is a stopped LWP
only if all the process’s LWPs are stopped, is stopped on an event of
interest only if all the LWPs are so stopped, or is in a PR_REQUESTED
stop only if there are no other events of interest to be found. The chosen
LWP remains fixed as long as all the LWPs are stopped on events of
interest and PCRUN is not applied to any of them.
When applied to the process control file, every /proc control operation that
must act on an LWP uses the same algorithm to choose which LWP to act
on. Together with synchronous stopping (see PCSET), this enables an
application to control a multiple-LWP process using only the process-level
status and control files if it so chooses. More fine-grained control can be
achieved using the LWP-specific files.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-9 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The psinfo file
Introduction
The psinfo file contains information about the process needed by the ps(1)
command. If the process contains more than one LWP, a representative
LWP (chosen according to the rules described for the status file) is used to
derive the status information.
File format
The file is formatted as a struct psinfo containing the following members:
ulong_t pr_flag;
ulong_t pr_nlwp;
uid_t
pr_uid;
gid_t
pr_gid;
pid_t
pr_pid;
pid_t
pr_ppid;
pid_t
pr_pgid;
pid_t
pr_sid;
caddr_t pr_addr;
long
pr_size;
long
pr_rssize;
timestruc_t pr_start;
timestruc_t pr_time;
dev_t
pr_ttydev;
char
pr_fname[PRFNSZ];
char
pr_psargs[PRARGSZ];
struct lwpsinfo pr_lwp;
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
/*
process flags */
number of LWPs in process */
real user id */
real group id */
unique process id */
process id of parent */
pid of process group leader */
session id */
internal address of process */
size of process image in pages */
resident set size in pages */
process start time, time since epoch */
usr+sys cpu time for this process */
controlling tty device (or PRNODEV)*/
last component of exec()ed pathname*/
initial characters of arg list */
"representative" LWP */
Platform
specific data
Some of the entries in psinfo, such as pr_flag and pr_addr, refer to internal
kernel data structures and should not be expected to retain their meanings
across different versions of the operating system. They have no meaning
to a program and are only useful for manual interpretation by a user aware
of the implementation details.
Zombies
psinfo is still accessible even after a process becomes a zombie.
Representative
LWP
pr_lwp describes the representative LWP chosen as described under the
pstatus file above. If the process is a zombie, pr_nlwp and pr_lwp.pr_lwpid
are zero and the other fields of pr_lwp are undefined.
-10 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
The map File
Introduction
The map file contains information about the virtual address map of the
process. The file contains an array of prmap structures, each of which
describes a contiguous virtual address region in the address space of the
traced process.
File format
The prmap structure contains the following members:
caddr_t
ulong_t
char
off_t
long
long
Member
description
pr_vaddr;
pr_size;
pr_mapname[32];
pr_off;
pr_mflags;
pr_filler[9];
/* Virtual address */
/* Size of mapping in bytes */
/* Name in /proc/pid/object */
/* Offset into mapped object, if any */
/* Protection and attribute flags */
/* For future use */
Members of the map file are described below:
Member
pr_mflags
Description
pr_vaddr
Virtual address of the mapping within the traced process
pr_size
Size of mapping in bytes
pr_mapname
If not empty string, contains name of a file in the object directory
that can be opened for reading to yield a file descriptor for the
object to which vitrual address is mapped.
pr_off
Offset within the mapped object (if any) to which the virtual
address is mapped
pr_mflags
Protection and attribute flags (see below)
pr_filler
For future use
pr_mflags is a bit-mask of protection and attribute flags:
Flag
MA_READ
MA_WRITE
MA_EXEC
MA_SHARED
Description
Mapping is readable by the traced process
Mapping is writable by the traced process
Mapping is executable by the traced process
Mapping changes are shared by mapped object
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-11 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The map File
Contiguous
address space
-- continued
A contiguous area of the address space having the same underlying
mapped object may appear as multiple mappings because of varying read,
write, execute, and shared attributes. The underlying mapped object does
not change over the range of a single mapping. An I/O operation to a
mapping marked MA_SHARED fails if applied at a virtual address not
corresponding to a valid page in the underlying mapped object. Reads and
writes to private mappings always succeed. Reads and writes to
unmapped addresses always fail.
-12 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
The cred File
Introduction
The cred file contains a description of the credentials associated with the
process.
File format
The file is formatted as a struct prcred containing the following members:
uid_t
uid_t
uid_t
gid_t
gid_t
gid_t
uint_t
gid_t
pr_euid;
pr_ruid;
pr_suid;
pr_egid;
pr_rgid;
pr_sgid;
pr_ngroups;
pr_groups[1];
/*
/*
/*
/*
/*
/*
/*
/*
Effective user id */
Real user id */
Saved user id (from exec) */
Effective group id */
Real group id */
Saved group id (from exec) */
Number of supplementary groups */
Array of supplementary groups */
The list of associated supplementary groups in pr_groups is of variable
length; pr_ngroups specifies the number of groups.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-13 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The sigact File
Introduction
The sigact file contains an array of sigaction structures describing the
current dispositions of all signals associated with the traced process.
Signal numbers are displaced by 1 from array indexes, so that the action
for signal number n appears in position n-1 of the array.
-14 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
lwp/lwpctl file
Introduction
The lwpctl file is a write-only control file. The messages written to this file
affect only the associated LWP rather than the process as a whole (where
appropriate).
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-15 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The lwp/lwpstatus File
Introduction
The lwp/lwpstatus file contains LWP-specific state information. This
information is present in the status file of the process for its representative
LWP, also.
File format
The file is formatted as a struct lwpstatus containing the following member
long
pr_flags;
/*
short
pr_why;
/*
short
pr_what;
/*
lwpid_t pr_lwpid;
/*
short
pr_cursig;
/*
siginfo_t pr_info;
/*
struct sigaction pr_action; /*
sigset_t pr_lwppend;
/*
stack_t pr_altstack;
/*
short
pr_syscall;
/*
short
pr_nsysarg;
/*
long
pr_sysarg[PRSYSARGS];/*
char
pr_clname[PRCLSZ];
/*
ucontext_t pr_context;
/*
pfamily_t pr_family;
/*
Flags */
Reason for stop (if stopped) */
More detailed reason */
Specific LWP identifier */
Current signal */
Info associated with signal or fault */
Signal action for current signal */
Set of LWP pending signals */
Alternate signal stack info */
System call number (if in syscall) */
Number of arguments to this syscall */
Arguments to this syscall */
Scheduling class name */
LWP context */
Processor family-specific information */
Continued on next page
-16 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
The lwp/lwpstatus File
Member
description
Guide
-- continued
Here is a description of members of the lwpstatus file:
Member
pr_flags
pr_why
pr_what
pr_lwpid
pr_cursig
pr_info
pr_action
pr_lwppend
pr_altstack
pr_syscall
pr_nsysarg
pr_sysarg
pr_clname
pr_context
pr_family
Description
A bit mask holding flags (described below)
Reason for LWP stop (if stopped). Possible values listed
below.r
More detailed reason for LWP stop. pr_why and pr_what
together, describe the reason for a stopped LWP.
Specific LWP identifier.
Names the current signal; that is, the next signal to be
delivered to the LWP.
When the LWP is in a PR_SIGNALLED or PR_FAULTED
stop, pr_info contains additional information pertinent to the
particular signal or fault. (See sys/siginfo.h)
Contains signal action information about the current signal
(see sigaction(2)). It is undefined if pr_cursig is zero.
Identifies any synchronously-generated or LWP-directed
signals pending for the LWP. Does not include signals
pending at the process leel.
Contains the alternate signal stack information for the LWP.
(see sigaltstack(2)).
Number of the system call, if any, being executed by the
LWP. It is nonzero if and only if the LWP is stopped on
PS_SYSENTRY or PR_SYSEXIT or is asleep with a
system call (PR_ASLEEP is set)
If pr_syscall is non-zero, pr_nsysarg is the number of
arguments to the system call
Array of arguments to the system call.
Contains the name of the scheduling class of the LWP.
Contains the user context of the LWP, as if it had called
getcontext(2). If the LWP is not stopped, all context values
are undefined.
Contains the CPU-family specific information about the
LWP. Use of this field is not portable across different
architectures.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-17 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
The lwp/lwpstatus File
pr_flags
pr_flags is a bit-mask holding these flags:
Flag
PR_STOPPED
PR_ISTOP
PR_DSTOP
PR_STEP
PR_ASLEEP
PR_PCINVAL
pr_why
-- continued
Description
LWP is stopped
LWP is stopped on an event of interest (see PCSTOP)
LWP has a stop directive in effect (see PCSTOP)
LWP has a single-step directive in effect
LWP is in an interruptible sleep within a system call
LWP program counter register does not point to a valid
address
Possible values of pr_why are:
Value
PR_REQUESTED
Description
Shows that the stop occurred in response to a
stop directive, normally because PCSTOP was
applied or because another LWP stopped on an
event of interest and the asynchronous-stop flag
(see PCSET) was not set for the process.
pr_what is unused in this case.
PR_SIGNALLED
Shows that the LWP stopped on receipt of a
signal (see PCSTRACE); pr_what holds the
signal number that caused the stop (for a newlystopped LWP, the same value is in pr_cursig)
shows that the LWP stopped on incurring a
hardware fault (see PCSFAULT); pr_what holds
the fault number that caused the stop
Show a stop on entry to or exit from a system call
(see PCSENTRY and PCSEXIT); pr_what holds
the system call number.
Sows that the LWP stopped because of the
default action of a job control stop signal (see
sigaction(2)); pr_what holds the stopping signal
number.
PR_FAULTED
PR_SYSENTRY
PR_SYSEXIT
PR_JOBCONTROL
-18 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
The lwp/lwpsinfo File
Introduction
The lwp/lwpsinfo file contains information about the LWP needed by ps(1).
This information also is present in the psinfo file of the process for its
representative LWP if it has one.
File format
The file is formatted as a struct psinfo containing the following members:
ulong_t
pr_flag;
/* LWP flags */
lwpid_t
pr_lwpid;
/* LWP id */
caddr_t
pr_addr;
/* internal address of LWP */
caddr_t
pr_wchan;
/* wait addr for sleeping LWP */
uchar_t
pr_stype;
/* synchronization event type */
uchar_t
pr_state;
/* numeric scheduling state */
char
pr_sname;
/* printable character representing pr_state
*/
uchar_t
pr_nice;
/* nice for cpu usage */
int
pr_pri;
/* priority, high value = high priority */
timestruc_t
pr_time;
/* usr+sys cpu time for this LWP */
char
pr_clname[8];
/* Scheduling class name */
char
pr_name[PRFNSZ]; /* name of system LWP */
processorid_t pr_onpro;
/* processor on which LWP is running */
processorid_t pr_bindpro;
/* processor to which LWP is bound */
processorid_t pr_exbindpro;
/* processor to which LWP is exbound */
Platformspecific data
Some of the entries in lwpsinfo, such as pr_flag, pr_addr, pr_state,
pr_stype, pr_wchan, and pr_name, refer to internal kernel data structures
and should not be expected to retain their meanings across different
versions of the operating system. They have no meaning to a program and
are only useful for manual interpretation by a user aware of the
implementation details.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-19 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
Control Messages
Introduction
Process state changes are affected through messages written to the ctl file
of the process or to the lwpctl file of an individual LWP.
Sending
control
messages
All control messages consist of an int naming the specific operation
followed by additional data containing operands (if any). Multiple control
messages can be combined in a single write(2) to a control file, but no
partial writes are permitted; that is, each control message (operation code
plus operands) must be presented in its entirety to the write and not in
pieces over several system calls.
ENOENT
Note that writing a message to a control file for a process or LWP that has
exited elicits the error ENOENT.
List of
messages
Here is a list of the allowable control messages:
Control Message
PCSTOP
PCDSTOP
PCWSTOP
PCRUN
PCSTRACE
PCSSIG
PCKILL
PCUNKILL
PCSHOLD
PCSFAULT
-20 of 36 AIX 5L Internals
Description
Stops a LWPs
Stops a LWPs
Stops a LWPs
Makes a LWP runnable again after a stop.
Defines a set of signals to be traced in the process
Contains the current signal and its associated
signal information????
End the process or LWP immediately????
????
Set the held signals for the specific or chosen LWP
according to the operand sigset_t structure
Define a set of hardware faults to be traced in the
process
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
PCSTOP, PCDSTOP, and PCWSTOP
Introduction
There are three control messages that stop LWPs. They perform in
different ways. They are:
• PCSTOP
• PCDSTOP
• PCWSTOP
PCSTOP
When applied to the process control file, directs all LWPs to stop and waits
for them to stop. Completes when every LWP has stopped on an event of
interest.
When applied to an LWP control file, directs the specific LWP to stop and
waits until it has stopped. Completes when the LWP stops on an event of
interest, immediately if already so stopped.
PCDSTOP
When applied to the process control file, directs all LWPs to stop without
waiting for them to stop.
When applied to an LWP control file, directs the specific LWP to stop
without waiting for it to stop
PCWSTOP
When applied to the process control file, simply waits for all LWPs to stop.
Completes when every LWP has stopped on an event of interest.
When applied to an LWP control file, simply waits for the LWP to stop.
Completes when the LWP stops on an event of interest, immediately if
already so stopped
Event of
interest
An event of interest is either a PR_REQUESTED stop or a stop that has
been specified in the process’s tracing flags (set by PCSTRACE,
PCSFAULT, PCSENTRY, and PCSEXIT). A PR_JOBCONTROL stop is
specifically not an event of interest. (An LWP may stop twice because of a
stop signal; first showing PR_SIGNALLED if the signal is traced and again
showing PR_JOBCONTROL if the LWP is set running without clearing the
signal.) If PCSTOP or PCDSTOP is applied to an LWP that is stopped, but
not on an event of interest, the stop directive takes effect when the LWP is
restarted by the competing mechanism; at that time the LWP enters a
PR_REQUESTED stop before executing any user-level code.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-21 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCSTOP, PCDSTOP, and PCWSTOP
-- continued
Blocked
control
messages
A write of a control message that blocks is interruptible by a signal so that,
for example, an alarm(2) can be set to avoid waiting forever for a process
or LWP that may never stop on an event of interest. If PCSTOP is
interrupted, the LWP stop directives remain in effect even though the write
returns an error.
System
process
A system process (indicated by the PR_ISSYS flag) never executes at
user level, has no user-level address space visible through /proc, and
cannot be stopped. Applying PCSTOP, PCDSTOP, or PCWSTOP to a
system process or any of its LWPs elicits the error EBUSY.
-22 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
PCRUN
Introduction
The control message PCRUN makes an LWP runnable again after a stop.
The operand is a set of flags, contained in a ulong_t, describing optional
additional actions.
Flag
descriptions
Here is a description of the flags contained in the operand of PCRUN:
Flag
PRCSIG
PRCFAULT
PRSTEP
PRSABORT
PRSTOP
Using PCRUN
on an LWP
Description
Clears the current signal, if any (see PCSSIG)
Clears the current fault, if any (see PCCFAULT)
Directs the LWP to execute a single machine
instruction. On completion of the instruction, a trace
trap occurs. If FLTTRACE is being traced, the LWP
stops, otherwise it is sent SIGTRAP; if SIGTRAP is
being traced and not held, the LWP stops. When the
LWP stops on an event of interest the single-step
directive is cancelled, even if the stop occurs before
the instruction is executed. This operation requires
hardware and operating system support and may not
be implemented on all processors
Is significant only if the LWP is in a PR_SYSENTRY
stop or is marked PR_ASLEEP; it instructs the LWP
to abort execution of the system call (see
PCSENTRY, PCSEXIT).
Directs the LWP to stop again as soon as possible
after resuming execution (see PCSTOP). In
particular if the LWP is stopped on PR_SIGNALLED
or PR_FAULTED, the next stop will show
PR_REQUESTED, no other stop will have
intervened, and the LWP will not have executed any
user-level code
When applied to an LWP control file PCRUN makes the specific LWP
runnable. The operation fails (EBUSY) if the specific LWP is not stopped
on an event of interest.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-23 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCRUN
Using PCRUN
on a process
-- continued
When applied to the process control file an LWP is chosen for the
operation as described for /proc/pid/status. The operation fails (EBUSY) if
the chosen LWP is not stopped on an event of interest. If PRSTEP or
PRSTOP were requested, the chosen LWP is made runnable; otherwise,
the chosen LWP is marked PR_REQUESTED. If as a result all LWPs are
in the PR_REQUESTED stop state, they are all made runnable.
Once an LWP has been made runnable by PCRUN, it is no longer stopped
on an event of interest even if, because of a competing mechanism, it
remains stopped.
-24 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
PCSTRACE
Introduction
PCSTRACE Define a set of signals to be traced in the process: the receipt
of one of these signals by an LWP causes the LWP to stop. The set of
signals is defined using an operand sigset_t contained in the control
message.
SIGKILL
Receipt of SIGKILL cannot be traced; if specified, it is silently ignored.
Held signals
If a signal that is included in a held signal set of an LWP is sent to the LWP,
the signal is not received and does not cause a stop until it is removed
from the held signal set, either by the LWP itself or by setting the held
signal set with PCSHOLD or the PRSHOLD option of PCRUN.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-25 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCCSIG
Introduction
PCCSIG
The current signal and its associated signal information for the specific or
chosen LWP are set according to the contents of the operand siginfo
structure (see ). If the specified signal number is zero, the current signal is
cleared. An error (EBUSY) is returned if the LWP is not stopped on an
event of interest. The semantics of this operation are different from those
of kill(2), _lwp_kill(2), or PCKILL in that the signal is delivered to the LWP
immediately after execution is resumed (even if the signal is being held)
and an additional PR_SIGNALLED stop does not intervene even if the
signal is being traced. Setting the current signal to SIGKILL ends the
process immediately.
-26 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
PCKILL, PCUNKILL
Introduction
PCKILL
If applied to the process control file, a signal is sent to the process with
semantics identical to those of kill(2). If applied to an LWP control file, a
signal is sent to the LWP with semantics identical to those of _lwp_kill(2).
The signal is named in an operand int contained in the message. Sending
SIGKILL ends the process or LWP immediately.
PCUNKILL
A signal is deleted, that is, it is removed from the set of pending signals. If
applied to the process control file, the signal is deleted from the process’s
pending signals. If applied to an LWP control file, the signal is deleted from
the LWP’s pending signals. The current signal (if any) is unaffected. The
signal is named in an operand int in the control message. It is an error
(EINVAL) to attempt to delete SIGKILL.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-27 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCSHOLD
Introduction
Set the set of held signals for the specific or chosen LWP (signals whose
delivery will be delayed if sent to the LWP) according to the operand
sigset_t structure. SIGKILL or SIGSTOP cannot be held; if specified, they
are silently ignored.
-28 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
PCSFAULT
Introduction
PCSFAULT defines a set of hardware faults to be traced in the process: on
incurring one of these faults an LWP stops. The set is defined via the
operand fltset_t structure.
Fault names
Some fault names may not occur on all processors; there may be
processor-specific faults in addition to these. Fault names include the
following:
Fault Name
FLTILL
FLTPRIV
FLTBPT
Description
Illegal instruction
Privileged instruction
Breakpoint trap
FLTTRACE
FLTACCESS
FLTBOUNDS
FLTIOVF
FLTIZDIV
FLTFPE
FLTSTACK
FLTPAGE
Trace trap
Memory access fault (bus error)
Memory bounds violation
Integer overflow
Integer zero divide
Floating-point exception
Unrecoverable stack fault
Recoverable page fault
When not traced, a fault normally results in the posting of a signal to the
LWP that incurred the fault. If an LWP stops on a fault, the signal is posted
to the LWP when execution is resumed unless the fault is cleared by
PCCFAULT or by the PRCFAULT option of PCRUN. FLTPAGE is an
exception; no signal is posted. There may be additional processor-specific
faults like this.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-29 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCSFAULT
pr_info field
-- continued
The pr_info field in /proc/pid/status or in /proc/pid/lwp/lw#/lwpstatus
identifies the signal to be sent and contains machine-specific information
about the fault. Signals can be any of the following and are described
below:
PCCFAULT
PCSENTRY,
PCSEXIT
PCSET
PCRESET
PSREG
PCSFPREG
PCNICE
PCCFAULT
The current fault (if any) is cleared; the associated
signal is not sent to the specific or chosen LWP.
These control operations instruct the process’s
LWPs to stop on entry to or exit from specified
system calls.
Sets one or more modes of operation for the traced
process.
Resets these modes. The modes to be set or reset
are specified by flags in an operand long in the
control message:
Sets the general registers for the specific or chosen
LWP according to the operand gregset_t structure.
Sets the floating-point registers for the specific or
chosen LWP according to the operand fpregset_t
structure.
Sets the LWP’s nice(2) priority.
The current fault (if any) is cleared; the associated signal is not sent to the
specific or chosen LWP.
Continued on next page
-30 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCSFAULT
PCSENTRY,
PCSEXIT
Guide
-- continued
These control operations instruct the process’s LWPs to stop on entry to or
exit from specified system calls. The set of system calls to be traced is
defined via an operand sysset_t structure.
When entry to a system call is being traced, an LWP stops after having
begun the call to the system but before the system call arguments have
been fetched from the LWP. When exit from a system call is being traced,
an LWP stops on completion of the system call just before checking for
signals and returning to user level. At this point all return values have been
stored into the LWP’s registers.
If an LWP is stopped on entry to a system call (PR_SYSENTRY) or when
sleeping in an interruptible system call (PR_ASLEEP is set), it may be
instructed to go directly to system call exit by specifying the PRSABORT
flag in a PCRUN control message. Unless exit from the system call is
being traced the LWP returns to user level showing error EINTR.
PCSET
PCSET sets one or more modes of operation for the traced process.
Continued on next page
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-31 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCSFAULT
PCRESET
-- continued
PCRESET resets these modes. The modes to be set or reset are
specified by flags in an operand long in the control message. The flags
are described below:
Flag
Description
PR_FORK (inherit-on-fork)
When set, the tracing flags of the process are
inherited by the child of a fork(2) or vfork(2).
When reset, child processes start with all tracing
flags cleared.
PR_RLC (run-on-last-close)
When set and the last writable /proc file
descriptor referring to the traced process or any
of its LWPs is closed, all the tracing flags of the
process are cleared, any outstanding stop
directives are canceled, and if any LWPs are
stopped on events of interest, they are set
running as though PCRUN had been applied to
them. When reset, the process’s tracing flags are
retained and LWPs are not set running on last
close.
PR_KLC (kill-on-last-close)
When set and the last writable /proc file
descriptor referring to the traced process or any
of its LWPs is closed, the process is exited with
SIGKILL.
PR_ASYNC (asynchronous-stop)
When set, a stop on an event of interest by one
LWP does not directly affect any other LWP in the
process. When reset and an LWP stops on an
event of interest other than PR_REQUESTED,
all other LWPs in the process are directed to
stop.
It is an error (EINVAL) to specify flags other than
those described above or to apply these
operations to a system process. The current
modes are reported in the pr_flags field of /proc/
pid/status
Continued on next page
-32 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
PCSFAULT
Guide
-- continued
EINVAL
It is an error (EINVAL) to specify flags other than those described above or
to apply these operations to a system process. The current modes are
reported in the pr_flags field of /proc/pid/status.
PCSREG
PCSREG sets the general registers for the specific or chosen LWP
according to the operand gregset_t structure. There may be machinespecific restrictions on the allowable set of changes. PCSREG fails
(EBUSY) if the LWP is not stopped on an event of interest.
PCSFPREG
PCSFPREG sets the floating-point registers for the specific or chosen
LWP according to the operand fpregset_t structure. An error (EINVAL) is
returned if the system does not support floating-point operations (no
floating-point hardware and the system does not emulate floating-point
machine instructions). PCSFPREG fails (EBUSY) if the LWP is not
stopped on an event of interest.
PCNICE
The traced (or chosen) LWP’s nice(2) priority is incremented by the
amount contained in the operand int. Only the super-user may better an
LWP’s priority in this way, but any user may make the priority worse. This
operation is significant only when applied to an LWP in the time-sharing
scheduling class.
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-33 of 36
Guide
Draft Version for review, Sunday, 15. October 2000, proc.fm
Directories
Introduction
Object
directory
The object directory contains read-only files with names as they appear in
the entries of the map file, corresponding to objects mapped into the
address space of the target process. Opening such a file yields a
descriptor for the mapped file associated with a particular address-space
region. The name a.out also appears in the directory as a synonym for the
executable file associated with the ‘‘text’’ of the running process.
The object directory makes it possible for a controlling process to get
access to the object file and any shared libraries (and consequently the
symbol tables)--in general, any mapped files--without having to know the
specific path names of those files.
lwp directory
The lwp directory contains entries each of which names an LWP within the
containing process. These entries are directories containing additional files
and are described beginning on page 15.
-34 of 36 AIX 5L Internals
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, proc.fm
Guide
Code Example
Introduction
The following code is an simple example of how one process can use the
/proc filesystem to access the address space of another. Provided with a
single argument (the id of a currently running process), it prints the name
of the process from the psinfo structure.
#include <stdio.h>
#include <fcntl.h>
#include <sys/procfs.h>
main(int argc, char **argv)
{
char fname[512];
struct psinfo p;
int fd;
/* check for an argument */
if (argc != 2)
exit(1);
sprintf(fname, "/proc/%s/psinfo", argv[1]);
/* check that the process id is still running */
if((access(fname, F_OK)) < 0)
exit(1);
fd = open(fname, O_RDONLY);
read(fd, &p, sizeof(struct psinfo));
printf("process pid %s: exec path/args: %s %s\n",
argv[1], p.pr_fname, p.pr_psargs);
}
© Copyright IBM Corp. 2000
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
-35 of 36
Guide
-36 of 36 AIX 5L Internals
Draft Version for review, Sunday, 15. October 2000, proc.fm
Version 20001015
Course materials may not be reporduced in whole or in part
without the prior writen permission of IBM.
© Copyright IBM Corp. 2000
Draft Version for review, Sunday, 15. October 2000, lastpage.fm
© Copyright IBM Corp. 2000
Version YYYYMMDD
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Guide
Download PDF
Similar pages