Operational Considerations and Troubleshooting Oracle Enterprise

Operational Considerations and
Troubleshooting for Oracle Enterprise Manager
12.1.0.4
ORACLE WHITE PAPER
|
OCTOBER 2014
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Contents
Introduction
Infrastructure Components
1
1
Oracle Management Service (OMS)
1
Systems and Services
1
Oracle Management Agent
2
Oracle Management Repository
2
Oracle Management Plug-ins
2
Enterprise Manager Cloud Control Console
2
EM CLI
3
Diagnostic Tools
3
EMDIAG
3
Best Practices Configuration
Staffing Recommendations
Administrator Responsibilities
Maximum Availability
3
4
4
5
Oracle Management Service Backups
5
Management Repository Backups
5
Management Agent Backups
7
Increased High Availability and Disaster Recovery Options
8
Notifications
9
Out-of-Band Notifications
9
Patching
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
10
Agent Patching
10
Repository Patching
11
OMS Patching
11
Plug-ins
11
Audit Log Data
12
Maintaining Enterprise Manager
Availability
13
14
Oracle Management Service
15
Repository Database
15
Agent Availability
19
General Availability
20
EM Internal Subsystems
21
DBMS Scheduler
21
Database Advanced Queuing (AQ)
22
Notification Subsystem
26
Task Subsystem
28
EM Job System
30
Agent Health
32
Events and Incidents
34
Log & Trace Files
36
Incident Files
37
OMS Incident Files
37
Agent Incident Files
38
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Troubleshooting
38
Conclusion
41
.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Introduction
There are many areas that need to be discussed when talking about managing Enterprise Manager in
a data center. Some of these are as follows:
» Recommendations for staffing roles and responsibilities for EM administration
» Understanding the components that make up an EM environment
» Backing up and monitoring EM itself
» Maintaining a healthy EM system
» Patching the EM components
» Troubleshooting and diagnosing guidelines
This whitepaper will help define administrator requirements and responsibilities, and guide you in
setting up the proper monitoring and maintenance activities to keep Oracle Enterprise Manager 12c
healthy.
Infrastructure Components
Ora cle Ma n a g em e nt S ervic e (OMS)
The Oracle Management Service performs several important tasks in an EM environment. It is the
web-based application that communicates with the Oracle Management Agents and Oracle
Management Plug-ins to discover, monitor and manage targets as well as store the information in the
Oracle Management Repository. It is also responsible for running the user interface for the Enterprise
Manager Cloud Control Console.
S ys te m s a nd Se rvic e s
In EM, an application can be modeled as a service that runs on a group of targets called a system. A
system is created to define the infrastructure required to host a specific application. Then, the
application can be defined as a service allowing monitoring and management of the application. Out
of the box, the EM components are combined into a system called “Management Services and
Repository”. Services have been created on this system for specific functions within EM itself as
described below.
EM Jobs Service
The EM Jobs Service is a service using the Management Services and Repository system and
consists of all components required for the EM jobs to function properly. The availability of the EM
Jobs System as a whole depends on the availability of each of the underlying components defined in
this service.
EM Console Service
The EM Console Service is a service using the Management Services and Repository system and
consists of all components required for the EM Console to function properly. The availability of the EM
Console System as a whole depends on the availability of each of the underlying components defined
in this service as well as a defined “EM Console Service Test” and the “EM Management Beacon”.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Ora cle Ma n a g em e nt Ag e n t
The Oracle Management Agent is deployed on each host to be managed by an EM environment. It is
responsible for managing and monitoring all of the targets on that host (including the host itself) and
communicating all information to the Oracle Management Service.
Ora cle Ma n a g em e nt Re p o s itory
The Oracle Management Repository is used for storing all of the data received from the Oracle
Management Agents. It organizes the data so that the Oracle Management Service can retrieve it and
display it in the Enterprise Manager Cloud Control Console.
Ora cle Ma n a g em e nt Plu g-in s
The core Enterprise Manager Cloud Control features for managing and monitoring the different Oracle
components are now provided via separate components called plug-ins. This allows the flexibility of
updating EM with the latest product releases for one or more component releases without having to
upgrade to a later Cloud Control release. These provide a more “pluggable” framework.
En terpris e Ma n a g er Clo u d Con trol Co n s ole
The Enterprise Manager Cloud Control Console is the user interface that provides one central location
for monitoring and administrating an entire environment.
Below is a picture of a typical environment showing how each of the above components interact.
Figure 1: EM Components
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
EM CLI
EM CLI is the Enterprise Manager Command Line Interface. Not only can this interface be executed
from an operating system console, it also allows administrators to run many EM commands via scripts
and thereby allows the customers to create workflow based on their business needs. Using this
interface, you can do many things such as manage credentials, define service targets, templates and
setup incidents. For more information about using EM CLI refer to the Oracle Enterprise Manager
Cloud Control Documentation.
Diagnostic Tools
EMDIAG
The EMDIAG Toolkit is a set of utilities that collect data from Cloud Control OMS, Repository and
Agents to assist in troubleshooting and maintenance. EMDIAG consists of REPVFY, OMSVFY and
AGTVFY Tools. Many of the recommendations in this whitepaper will utilize the EMDIAG tools. See
EMDIAG Troubleshooting Kits Master Index [421053.1] for more information.
REPVFY
The EMDIAG REPVFY 12c kit is designed to collect data from a Cloud Control Management
Repository 12c to assist in the diagnosis and correction of Cloud Control issues. For detailed
installation instructions see EMDIAG REPVFY Kit for Cloud Control 12c - Download, Install/De-Install
and Upgrade [ID 1426973.1]. For details on utilizing REPVFY see EMDIAG Repvfy 12c Kit - How to
Use the Repvfy 12c kit [ID 1427365.1].
OMSVFY
OMSVFY is installed on each OMS server and collects data on the OMS configuration and patches.
There are also several utilities available to help in searching log files, zipping the files for transfer to
support, and identifying trouble areas on the OMS. See note EMDIAG Omsvfy 12c Kit - Download and
Install [ID 1374450.1] for detailed installation instructions.
AGTVFY
AGTVFY gets installed on the each Agent server. This is a good component to become familiar with
and use when troubleshooting agent issues. For detailed install instructions see EMDIAG Agtvfy 12c
Kit - Download and Install [ID 1374441.1].
Best Practices Configuration
Enterprise Manager 12c Cloud Control is an enterprise application that manages and monitors the
infrastructure in your environment as well as the applications running on top of that infrastructure. The
system itself requires some care and feeding to ensure that it is performing properly and that the data
available is timely and accurate. One of the most common questions is who should manage EM and
how much effort will it require. This all depends on what functions you plan to leverage, how critical
the targets are, and the size of the environment.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Staffing Recommendations
As EM is very broad application on its own therefore, the recommendation is to have at least 2 people
trained and responsible for managing EM who know the system very well and maintain its health.
Depending on the size and scope of your environment, this may be 2-4 people who spend 25-50% of
their time on EM. This ensures backup coverage during vacation or extended illnesses. Someone
with knowledge of Oracle Database and WebLogic Server is extremely helpful as these are the main
backbones of EM; however they also need to understand your entire enterprise. Integration into
authentication and ticketing systems, placement in network/firewall rules, configuration of the Software
Load Balancer, segregation between support groups and organizations are all areas where the EM
Administrator will be required to interface during initial setup and continued operations. For further
details on EM best practices, refer to the note Oracle Enterprise Manager 12.1.0.4 Configuration Best
Practices [1929586.1].
Ad m in is trator Re s p o ns ibilitie s
Implementing EM and managing an enterprise will require involvement from various teams.
Companies divide the roles and responsibilities differently based on the size of the implementations
and the different data center responsibilities. There needs to be a well defined, agreed upon list of
tasks that identifies the individual or team responsible for particular tasks. This is often referred to as
a RACI diagram (Responsible, Accountable, Consulted and Informed). The EM Administrator should
own architecture and installation, overall agent deployment procedures, agent patching procedures,
OMS patching and user administration. It is also important for the EM Administrator to know the
baseline functionality and performance of their EM environment to more easily identify existing or
pending problems. Knowing the baseline environment consists of two items. The first item is to
understand and document the architecture of the environment (i.e. topology, key components). This
will help in understanding the impact of any architecture change. The second item is for the EM
Administrator to understand the normal baseline operations of the environment. This consists of
understanding the environment and the expected load (i.e. how much data to expect in a day). Things
like deploying agents, discovering targets, solving agent issues and solving target availability can all be
delegated to target owners. The RACI diagram below is an example of defining this responsibility and
is a starting point for your organization to define the roles and responsibilities in your environment even
if multiple roles are performed by the same person.
TABLE 1: ENTERP RISE MANAGER 12C RACI
Task
Responsible
Accountable
Define Monitoring Requirements
Target Owners,
Infrastructure
Teams, EM Admin
EM Admin
Installation planning and architecture
EM Admin
EM Admin
Installation and Configuration of EM
EM Admin
EM Admin
Consulted
Target Owners,
Infrastructure
Teams
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Informed
Defining Agent deployment and patching
procedures and processes
EM Admin
Security and User Administration
EM Admin/Security
Admin
Admin Group Creation
EM Admin
Agent Deployment (can be performed by target
owners)
Target Owners
Agent Patching (can be performed by target
owners)
Target Owners
Target Configuration and Availability
Target Owners
Agent Troubleshooting
Target Owners, EM
Admin
Target Troubleshooting
Target Owners
Weekly/Monthly/Quarterly Maintenance
EM Admin
OMS Patching
EM Admin
EM Admin
Target Owners
EM Admin
EM Admin
Target Owners
Target Owners
EM Admin
Target Owners
EM Admin
Target Owners
EM Admin
Target Owners
EM Admin
EM Admin
Target Owners
EM Admin
Target Owners
Maximum Availability
Since EM plays an important role in managing and monitoring the enterprise environment, it is
important to ensure that the environment is configured for maximum availability. This includes regular
backups as well as architecting the environment for disaster recovery. The Oracle Enterprise Manager
Cloud Control Advanced Installation and Configuration Guide provides details on backing up the
Enterprise Manager environment. As part of an overall backup strategy, it is important to take regular
backups as well as backups before any patching or plugin update is applied for the following:
Ora cle Ma n a g em e nt S ervic e Ba c ku p s
Backups for the OMS should consist of the following:
» Software Homes: filesystem level backup of the software homes and the Oracle inventory files whenever
patches or patchsets are applied
» Instance Homes/Administration Server/OMS Configuration: all of this information can be backed up by
issuing the emctl exportconfig oms command on each of the oms servers.
Refer to the Oracle Enterprise Manager Cloud Control Advanced Installation and Configuration Guide
for further details on backing up the OMS server(s).
Ma n a g e m e nt Re po s itory Ba c ku p s
The backup strategies for the repository are the same as for the Oracle Database. This includes
having the database in archivelog mode and performing regular hot backups with RMAN which
consists of a full backup and then incremental backups. EM provides a simple way to setup database
backups via the option for Oracle suggested backups. This backup strategy will create a full database
backup followed by an incremental backup on each subsequent run. The database backup will be
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
recovered using these incremental backups thus creating a new full backup baseline. For further detail
on the setup of Oracle Recommended Backups, refer to the Oracle Database 2 Day DBA 11g Release
2 (11.2) document. The steps for configuring the backup in EM are documented below.
1.
Click on Targets / Databases. Select the EM Repository database.
Figure 1: Databases
2.
From the database home page, click on Availability / Backup & Recovery / Schedule Backups…
3.
On the Schedule Backup page, select the proper login credentials for the database owner under the Host
Credentials section and then click on the push button Schedule Oracle-Suggested Backup
Figure 2: Schedule Backup
4.
Select the destination media for the backup and click Next
Figure 3: Backup Location
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
5.
Set the backup settings for the backup based on the destination chosen above (a disk backup was
selected for this example). Click Next
Figure 4: Oracle Suggested Backup
6.
Select the day and time to start the backups. Click Next.
Figure 5: Backup Schedule
7.
Review the backup details and if the information is correct, click Submit Job
Figure 6: Backup Review
Ma n a g e m e nt Ag e n t Ba c ku p s
For the management agent, a reference agent should be maintained and kept current with patches so
that if a management agent is lost, it can be reinstalled via cloning of this reference agent. Starting
with the EM 12cR3 release, there is a new option available which will allow for the creation of a custom
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
‘package’ for an Agent-side plugin that includes any required patches and updates. With this feature,
each deployment of that plugin to an Agent will deploy the updated version of that plugin. To create
an updated/revised Agent-side plugin, follow these steps (for more details on this process, refer to the
Oracle Enterprise Manager Cloud Control Administrator’s Guide):
1.
Update and patch one agent with all of the required changes
2.
Run this EMCLI command to create the custom plugin version based on this modified agent:
$ emcli create_custom_plugin_update \
-agent_name=”<patched agent name>”
\
-plugin_id=”<internal ID of the plugin>”
3.
To get the list of plugins and their ID’s for an Agent, use this EMCLI command:
$ emcli list_plugins_on_agent – agent_names=”<patched agent name>”
Once this custom plugin is created, any push of that plugin( with that version) to the Agent will mean
the custom updated plugin will be pushed.
In cre a s e d Hig h Ava ila bility a nd Dis a s ter Re c o very Optio n s
As the importance of Enterprise Manager grows, so do the availability requirements. For some
customers, it is just not enough to have a single OMS monitoring their entire database or WebLogic
infrastructure. There are additional HA configurations available to meet specific business
requirements. The table below details the different degrees of high availability that can be
implemented for Oracle Enterprise Manager. Additional information on High Availability configurations
can be found in the Enterprise Manager Cloud Control 12c Advanced Installation and Configuration
Guide.
TABLE 2: HIGH AVAILABILITY CONFIGURATIONS
Minimum
Recommended Load Balancer
Nodes
Nodes
Requirements
OMS and Repository database each reside on
their own host, no failover
1
2
None
Level 2
OMS installed on shared storage with VIP based
failover. Database replicated with Data Guard
2
4
None
Level 3
OMS in Active/Active configuration. Database is
using RAC + Data Guard
3
5
Local load balancer
Level
Description
Level 1
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Level 4
OMS on the primary site in Active/Active
Configuration. Repository deployed using Oracle
RAC.
4
8
Duplicate hardware deployed at the standby site.
Required: Local load
balancer for each site
Optional: Global load
balancer
DR for OMS and Software Library using Storage
Replication between primary and standby sites.
Database DR using Oracle Data Guard.
Note: Level 4 is a MAA Best Practice, achieving
highest availability in the most cost effective,
simple architecture.
Notifications
To properly monitor your EM environment, you need to receive notifications on events, incidents and
problems that occur on the infrastructure components. In addition to your standard notifications for
Database, FMW and Host targets Oracle recommends you set up notifications for the EM
infrastructure. To receive notifications on the OMS and Repository components that consist of your
EM infrastructure create an Incident Rule Set specifically for these targets. The steps to do this are
detailed in the section Setting Up Your Incident Management Environment of the Administrator’s
Guide. The best practice is to create a rule set for incoming Events on the OMS and Repository
target that creates an incident and sends a notification (via e-mail, ticket or SNMP traps) to the EM
Administrators for the categories listed below. The OMS and Repository target is an internal target
type that will contain all of the EM components such as the infrastructure hosts, repository database,
listeners, management services, etc. For the steps on how to create this rule set, refer to the My
Oracle Support (MOS) note Oracle Enterprise Manager 12.1.0.4 Configuration Best Practices
[1929586.1].
TABLE 3:INCIDENT RULE RECOMMENDATIONS
Category
Filters
Actions
Metric Alert
Severity in Critical, Warning
E-mail/Ticket EM Administrators
Metric Alert
All
If event open > 7 days, clear the event
Target Unreachable
Target Availability (Agent, Host)
E-mail/Ticket EM Administrators
Target Down
Target Availability.
E-mail/Ticket EM Administrators
High Availability
Severity in Critical
E-mail/Ticket EM Administrators
Target Error
Target Availability
E-mail/Ticket EM Administrators
Out-of-Band Notifications
Out-of-Band Notifications for Enterprise Manager 12c can be configured to send an email or trigger a
script when certain fatal conditions occur. This then allows the EM administrator to receive
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
notifications when there is a failure in an EM component. The notification is triggered in the following
scenarios:
» single OMS environment, if the OMS is down, but the Agent is up
» multi-OMS environment, if all OMSes are down, but the Agent is up
» if Repository database is unavailable (down, archive hung, listener down, etc)
Configure Out-of-Band Notifications by following the steps in note How To Setup Out Of Bound Email
Notification In 12c [1472854.1].
Patching
As with any application regular patch maintenance is key. The recommended patches for Enterprise
Manager Base OMS, Agent and various Plugins can be found and downloaded from My Oracle
Support. Note that when searching for patches using the Recommended Patch Advisor, make sure
you enter “Enterprise Manager 12.1.0.4.0” for the product to see the patches for the 12.1.0.4 version.
Oracle recommends setting up a planned maintenance window for the EM environment. This window
would provide time for regular patching and activities that may require downtime (i.e. plugin updates).
A good recommendation is to schedule this planned maintenance on a quarterly basis and to check for
the latest recommended patches at this same time (may vary according to the requirements of the
individual companies). Note that the patching for the different components (i.e. agent) may be
performed by different people or groups within your organization based on the roles and
responsibilities as mentioned in Table 1: Enterprise Manager 12c RACI above. For additional
information on guidelines for patching an Enterprise Manager environment, refer to these white
papers: Reducing Downtime While Patching Multi-OMS Environments and Oracle Enterprise Manager
Software Planned Maintenance.
Ag e n t P atc hin g
Keeping the Enterprise Manager Agent patched is a critical component to efficient and accurate
monitoring as the collection scripts reside in the agent. Using the automated patching feature in
Enterprise Manager it is possible to create a patch plan from tested and approved agent patches, and
deploy to many agents at one time or in batches. Recommended patches can be found by clicking
Enterprise / Provisioning & Patching / Patches & Updates by selecting the Recommended Patch
Advisor. Select Enterprise Manager Base Platform – Agent for the product, and the correct
Release and Platform. Version 12.1.0.4 while testing this. For version 12.1.0.3, Normal Oracle Home
preferred credentials must be set (or overridden during patching) for all Agent targets that will be
patched via EM. In 12.1.0.4, the Agent uses its internal credentials to Patch itself making the setting of
preferred credentials or specifying at run-time no longer required. The privileged credentials will need
to be provided for any patch/upgrade requiring execution of the root.sh script if wanting EM to execute
if as part of the patch apply. The user patching requires the Manage Target Patch and Patch Plan
privileges Full step by step instructions can be found in the Oracle Enterprise Manager Cloud Control
Administrator’s Guide.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Re p o s itory P atc hin g
The recommended Database patches can be found on My Oracle Support /Patches & Updates by
selecting the Recommended Patch Advisor and select Oracle Database for the product and the
appropriate Release and Platform.
OMS P atc hin g
For the OMS, patches must be manually applied with OPatch or OPatchauto. Some patches require
all OMS servers to be down during the application of any post-patch scripts. In multi-OMS
environments, it is possible to shorten the patching cycle by following the procedure below:
1.
shut down the 1st OMS
2.
apply the patch
3.
shutdown the remaining OMSes
4.
run the post patch scripts
5.
restart the 1st OMS to reduce downtime
6.
patch the remaining OMS servers and then restart them
For further details on OMS patching see the Oracle Enterprise Manager Cloud Control Administrator’s
Guide. Oracle is now creating rolling OMS patches which provide even higher availability since all
OMSes do not have to be shutdown to apply the patch but it can be applied in a rolling fashion. Not all
patches are able to be rolling patches so it is important to check the individual patch README.txt file.
P lu g-in s
To make the Enterprise Manager 12c framework extensible, the plug-ins contain all the binaries
needed for specific components; therefore each plug-in has its own ORACLE_HOME on the OMS and
sometimes the Agent. For example, a database plug-in is deployed on the OMS and Agent. The
scripts that collect metrics from the database reside in the plug-in home. There will be plug-in specific
patches for these components. They can be found in My Oracle Support by looking for Enterprise
Manager for Oracle Database or Enterprise Manager for Fusion Apps, etc.. These patches also
require that the OMS be shutdown during patching so it is a good idea to combine them in the same
patching window as any OMS patch requiring downtime.
Starting with 12.1.0.4, the individual OMS-side plug-in bundles are being grouped into a System Patch
each month. So for example, in June 2014 the System patch includes MOS, Cloud, DB, FA, FMW,
SMF, and Siebel plug-ins. Non-required patches will be skipped during the application of the patch.
For more details on plug-ins and how to maintain them, see the Oracle Enterprise Manager Cloud
Control Administrator’s Guide. For more information on the EM Patch Bundles and Patching EM:
Enterprise Manager 12.1.0.4.0 (PS3) Master Bundle Patch List (Doc ID 1900943.1)
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 7: Patch Advisor
Audit Log Data
Oracle always audits certain operations regardless of the database audit settings. This is referred to
as Mandatory Auditing and the audit records are written to the operating system in the destination
specified by the initialization parameter AUDIT_FILE_DEST.
Mandatory auditing includes these operations:
» Database startup
» SYSDBA and SYSOPER logins
» Database shutdown
The OMS servers have an agent that resides on each of them. This agent logs into the repository
every few minutes for self monitoring thereby causing an audit record for each login. Therefore, it is
very important that the audit records are regularly archived and purged. The steps for doing this may
vary according to a company’s security requirements but a sample setup is provided below.
Archive the audit data. Archiving of the mandatory audit records from the operating system can be
done via Oracle Audit Vault or tape/disk backups. For further details on using Oracle Audit Vault, refer
to Oracle Audit Vault Administrator's Guide.
Purge the records. This can be done manually or via a purge job that performs the purge at a
specified time interval. The recommendation is to setup a job that will purge the records at a specified
time interval and is the example shown below. Note that purging a large audit trail can take time to
complete so it is wise to schedule the job so that it runs during a time when the database is not too
busy. For further details on the process and an explanation for each parameter used in the example,
refer to the Oracle Database Security Guide.
1.
Initialize the audit trail cleanup operation.
SQL> begin
dbms_audit_mgmt.init_cleanup(
AUDIT_TRAIL_TYPE => DBMS_AUDIT_MGMT.AUDIT_TRAIL_ALL,
DEFAULT_CLEANUP_INTERVAL => 12);
end;
/
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
2.
Setup an archive timestamp for the audit records. The RAC_INSTANCE_NUMBER refers to the instance
number when using a RAC database. This must be set for each instance in a RAC database since the
mandatory audit records are stored on the operating system and therefore for each instance.
SQL> begin
DBMS_AUDIT_MGMT.SET_LAST_ARCHIVE_TIMESTAMP(
AUDIT_TRAIL_TYPE => DBMS_AUDIT_MGMT.AUDIT_TRAIL_OS,
LAST_ARCHIVE_TIME => TO_DATE('2013-07-29 09:00:00','YYYY-MM-DD HH:MI:SS'),
RAC_INSTANCE_NUMBER => 1);
END;
/
3.
Create and schedule the purge job
SQL> BEGIN
DBMS_AUDIT_MGMT.CREATE_PURGE_JOB(
AUDIT_TRAIL_TYPE => DBMS_AUDIT_MGMT.AUDIT_TRAIL_ALL,
AUDIT_TRAIL_PURGE_INTERVAL => 12,
AUDIT_TRAIL_PURGE_NAME => 'Standard_Audit_Trail_Cleanup',
USE_LAST_ARCH_TIMESTAMP => TRUE);
END;
/
Maintaining Enterprise Manager
To ensure Enterprise Manager is configured and optimized properly, implementation planning should
take into account the sizing recommendations provided in the Oracle Enterprise Manager Cloud
Control Advanced Installation and Configuration Guide. Sizing is based on a combination of number of
agents, targets and concurrent users. After implementation, review the system sizing and usage on a
regular basis to account for system growth. Frequently review updates in Information Center:
Enterprise Manager Base Platform Release Cloud Control 12c [ID 1379818.2] to look for updates,
patches or known bugs that should be addressed.
The OMS servers process incoming and outgoing tasks. The incoming tasks are telemetry data and
alert information coming in from the agents. A problem occurs if there is more data coming in than the
network can handle. This is seen via the following:
» Network statistics (bandwidth/IO throughput/collisions)
» Loader backlog
» Job backlog (only if there is a backlog and a low number of available threads)
The outgoing tasks are created when the OMS sends requests out to the agents (config updates and
job/tasks to perform on the managed targets) and when the OMS processes and sends out the
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
notifications. To detect if an OMS server is having a bottleneck with outgoing tasks, look for the
following:
» A job backlog even though a significant number of jobs are processed (sufficient throughput)
» Notification backlog even though there is a significant ‘churn’ on getting stuff out.
An additional OMS may need to be added into an environment based on the following situations. Note
that the more incoming/outgoing stress on the system, the more likely the need for an additional OMS.
» load (number of agents and number of Admins) and whether or not that load is increasing
» backlog for incoming or outgoing tasks (as discussed above)
In addition to proper sizing and configuration, there are a few areas that should be checked on a
regular basis using the EM Cloud Control Console itself as well as EMDIAG. Both of these tools
provide a good way to make sure any issues that occur in the EM components can be identified and
resolved. Below are the recommended tasks and frequency to maintain a healthy Enterprise Manager
environment. The need to review the daily tasks should lessen as proper notifications and incidents
are setup and the EM Admin has established a good baseline and understanding of the data
components.
TABLE 4:RECOMMENDED MAINTENANCE TAS KS
Task
Daily
Review critical EM component availability
X
Review events, incidents and problems for EM related infrastructure
X
Review overall health of the system including the job system, backlog, load,
notifications and task performance
X
Review Agent issues for obvious problems (i.e. large percentage of agents
with an unreachable status)
X
Biweekly
Review Agent issues (deeper /more detailed review of agents with
consistent or continual problems)
X
Review metric trending for anything out of bounds
X
Evaluate database (performance, sizing, fragmentation)
Monthly
Quarterly
X
Check for updates in Self Update (plug-ins, connectors, agents, etc.) Note
that there is an out-of-box ruleset that will provide notification for the
availability of new updates.
X
Check for recommended patches
X
Availability
When confirming the health of the EM 12c environment the first place to start is to verify the status of
the key components that make up this environment. Enterprise Manager is dependent upon many
components for a complete working system. The Repository database, OMS, Console and PBS
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
services, WebLogic servers all have to be available for EM to function properly. A key component that
is down could impact performance as well as availability. The goal is to keep the infrastructure
components in an available status and to resolve any critical errors occurring in each one
Ora cle Ma n a g em e nt S ervic e
The Management Services page provides a more detailed status of the OMS services. In Cloud
Control, click on Setup / Manage Cloud Control / Management Services.
Figure 8: Manage Cloud Control Management Services
The figures above show the information about the Management Services running in Normal mode and
in Stand-by mode (if applicable). In version 12.1.0.4, this page will show additional details such as a
summary of the job executions, graphs showing loader performance and any open incidents. Verify
that the Management Services in Normal mode show an Up status, including the status of the Console
and Platform Background Service (PBS) for each Management Service.
Re p o s itory Data b a s e
Verify the status of the Repository database and underlying instances in the case of a RAC database.
Click on Setup / Manage Cloud Control / Repository. Under the Repository Details section, click
on the name of the Database or Cluster Database.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 9: Manage Cloud Control Repository
In the case of a standalone database, the Status section will show the Up Time for the database. On
the target menu bar, click on Availability / High Availability Console. On this page, the status of the
database should show Up.
In the case of a RAC cluster database, the Status section will show the number of Instances for this
database and the status summary. Further down on this page under the Instances section, verify that
each instance is in a “good” status. also It is also possible to view the status of the cluster database by
clicking on Availability / High Availability Console. If implemented with Level 3 or Level 4 High
Availability, also validate the standby status in the High Availability Console.
Figure 10: High Availability Console
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
EM version 12.1.0.4 now provides more details on the OMS and Repository target page. It includes
three tabs of information called Repository, Metrics, and Schema. Each tab includes the following
data:
Repository
This page provides details about the repository database including the following:
» Configuration Details
» Initialization Parameters
» Incidents
» Repository Job Status
» Collection Performance
» Metric Rollup Performance
Figure 112: OMS and Repository - Repository
Metrics
This tab provides graphs showing the rollup of key repository performance measurements. The
information includes the following:
» Top 25 Metric Data Loading Target Types In Last 30 Days
» Top 10 Data Loading Metrics In Last 30 Days
» Metric Alerts Per Day In Last 30 Days
» Top 10 Metric Collection Errors By Target Type In Last 30 Days
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 123: OMS and Repository - Metrics
Schema
The Schema tab provides data pertaining to the repository database schema. The information
includesthe following:
» Tablespace Growth Rate
» Top 20 Tables With Unused Space In Repository
» Purge Policies
» Partition Retention
Figure 134: OMS and Repository - Schema
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Ag e n t Ava ila b ility
Prior to EM version 12.1.0.3.0, the status of the host target was derived from the status of the
agent target monitoring that particular host. EM had no way of knowing if a host had gone down.
When the agent missed a certain number of heartbeats, the OMS would run a reverse ping job to
check the agent’s upload and host communication status. Based on this outcome, the OMS would
mark the agent as “Unreachable”. Once the agent is able to communicate to the OMS again, it
would “tell” the OMS that it had been down. Therefore, the OMS only knew about the status of the
agent. In the case of agent down, it only knew about past statuses – never the present/current
severity. To further add to the problem, the OMS was not able to understand the actual status of a
host target.
Starting with the 12.1.0.3.0 release, EM is now able to more quickly determine the status of an
agent as well as the status for the host the agent is running on. This is done with a new feature
referred to as a partner agent. When an agent is pushed to a host, the OMS determines the
closest agent to that host (in the same sub-net) and pushes the monitoring details to that agent
(the EMD URL of the monitored agent). This partner agent will check the status of the agent it is
monitoring on a regular basis. If it fails to receive a response from the agent, it will then check the
status for that agent’s host and update the OMS with the proper status for both the agent and the
host.
To help explain this, consider the following scenarios:
Scenario 1
In this scenario, the agent goes down and the host reboots before the agent comes up.
TABLE 5:S CENARIO 1
Time
Agent
Host
Partner Sends
Agent Status in EM
Host Status in EM
10:00
Goes DOWN
Is UP
Agent DOWN, Host UP
Agent Unreachable, Down
Agent Unreachable, with
sub status host UP
(unmonitored)
10:02
Crashes
Agent DOWN, Host
DOWN
Agent Unreachable, Down
Agent Unreachable, Down
10:10
Comes UP
Agent DOWN, Host UP
Agent Unreachable, Down
Agent Down, with sub
status host UP
(unmonitored)
UP
UP
10:15
Comes UP,
uploads
severities and
send clean
heartbeat
Scenario 2
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
In this scenario, there is a network issue between the agent and its partner agent with no
communication to the OMS.
TABLE 6:S CENARIO 2
Time
Agent
Host
10:00
Network comm. Break
10:05
Network comm. Is UP
between all players and pings
OMS
Partner Sends
Agent Status in EM
Host Status in EM
Agent Unreachable, Host
DOWN
Unreachable, Normal
Agent Unreachable, Down
UP
UP
Scenario 3
In this scenario, There is a network issue between the agent and its partner agent but the agent is
able to communicate to the OMS. In this example, the unreachable status will be quickly cleared
and the history would show that the agent and host never went down.
TABLE 7:S CENARIO 3
Time
Agent
10:00
Network
comm. Break
between
agent and
partner agent
10:01
Agent sends
a clean
heartbeat to
OMS
10:01
Network
comm. Issue
is resolved
between
agent and
partner agent
Host
Partner Sends
Agent Status in EM
Host Status in EM
Agent Unreachable, Host
DOWN
Agent Unreachable,
Normal
Agent Unreachable,
Normal
UP (unreachable is
cleared)
UP (unreachable is
cleared)
Ge n e ral Ava ila bility
To confirm the overall health of the complete list EM components from Enterprise Manager Console
navigate to Setup / Manage Cloud Control / Health Overview where the overall status is displayed.
To drill further into each component, click on the menu bar for OMS and Repository / Members / Show
All.
Check the status of the key components such as the EM services, application deployment, WebLogic
Deployments as described above. The status should show Up. Clicking on the status icon will drill
down to show availability details. Each component represents a target in EM. If any components are
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
down, use the information provided on the target’s home page (i.e. errors/alerts) to assist in diagnosing
and resolving the availability issue. It is important to note, if the system is configured with Level 4 High
Availability using the standby domain setup, the standby OMS servers will show down. For additional
information on High Availability configurations, see the Oracle Enterprise Manager Cloud Control
Advanced Installation and Configuration Guide.
Figure 145: OMS and Repository All Members
EM Internal Subsystems
There are several internal subsystems that work in the background to process incoming data, evaluate
alerts and severities, send notifications and do internal housekeeping for EM. This section will review
four of the critical subsystems.
DBMS S c h e d uler
The DBMS Scheduler is a database feature and is used to execute SQL and PL/SQL at specific times.
If any of the system jobs are running behind schedule or down completely, they can cause significant
performance problems, stale and incorrect availability data, as well as missing critical alerts and
notifications. For the repository jobs to run, the DBMS_SCHEDULER must be enabled and db
initialization parameter JOB_QUEUE_PROCESSES must be set to a non-zero value. It is common to
set JOB_QUEUE_PROCESSES to 0 during upgrades or patches, so be sure to reevaluate often.
To view the Job status click on Setup / Manage Cloud Control / Repository.
Figure 156: Repository Jobs
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
In the Repository Scheduler Jobs Status section, check the following items:
1.
Status - Make sure all jobs are Up. If there are errors, click on the error to get more details.
2.
Processing Time (%) (Last Hour) – Seconds per hour for a job. If a job is consistently running at 50% or
more, there may be a resource problem in the database. The overall health and performance of the
database should be checked and any issues resolved if found to make sure the database does not start to
fall behind and thereby create a permanent backlog problem. If the processing time increases and runs
consistently as high as 75%, this is a problem and it may mean a need to increase resources for the
repository server.
3.
Next Scheduled Run - If the next scheduled time is not correct or empty, the database has stopped
scheduling the job. The job that is not running can be resubmitted by selecting the job and clicking on the
“Restart Job” button at the top right of that window. It also provides an edit option for high cost
performance jobs to provide the ability to reschedule the next runtime. Only change the frequency of
runtime under guidance of Oracle.
A few of the more critical system jobs are listed below with a description of the tasks that they control:
TABLE 8:KEY S CHEDULER J OBS
Job Name
Scheduler Job Name
Task
Agent Ping
EM_PING_MARK_NODE_STATUS
Keeps track of the health of the host targets in EM. A nonzero number
means there are machines that are suspected to be down. As long as this
number is low relative to the total number of machines in EM (considering
that some may be in blackout or offline), there is not a major health issue
for EM. There is a potential problem if the processing time is showing 3040% or higher and should be diagnosed further.
Daily Maintenance
EM_DAILY_MAINTENANCE
This job does the daily repository maintenance tasks such as partition
maintenance, stats updates, etc. If this job is not running, you will
eventually stop receiving information into the repository.
Job Step
EM_JOBS_STEP_SCHED
This is the job that puts the work into the queues that are ready to be
dispatched to the agents.
Repository Metrics
MGMT_COLLECTION.Collection
Subsystem
This job shows the amount of work done for the repository metrics. This
metric will have a number associated with it (i.e. Repository Metrics 71)
and represent the short and long task workers. The short task workers
handle tasks that should run in a minute or less and the long task workers
handle the longer tasks. The best thing to look for here is that all
Repository Metric jobs are within 10% of each other.
Rollup
EM_ROLLUP_SCHED_JOB
This job indicates the amount of data involved in the rollup job. This
number may increase over time as more targets are added to the system
but on a daily basis should remain about the same. Large spikes could
indicate that agents are not communicating properly to the OMS.
Da ta b a s e Ad va n c e d Qu e uin g (AQ)
Both the OMS and the repository rely heavily on Advanced Queues. This then implies that the
Advanced Queues have to be ‘up’ and healthy. To confirm the status of the Advanced Queues in EM,
do the following:
1.
Click on Setup / Manage Cloud Control / Health Overview
2.
In the drop down list next to “OMS and Repository” select Monitoring/All Metrics
3.
Look at the Metric for Management Services AQ Status as seen in the figure below.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 167: Advanced Queuing
The current severity status of the underlying components can be checked by clicking on the Dequeue
Status or Enqueue Status for a particular Management Services AQ as seen in the figure below.
Figure 178: Advanced Queuing
If system performance deviates from previously experienced levels, it is possible that the AQ have
become fragmented. Refer to the MOS note on AQ performance tuning for further details:
Performance Tuning Advanced Queuing Databases and Applications [102926.1].
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Details on the Advanced Queuing can also be seen via the following option in EMDIAG:
$ repvfy show aq
Name
-------------------------------EM_CNTR_QUEUE
EM_EVENT_BUS
EM_GROUP_EVENT_Q
EM_NOTIFY_Q
EM_SYSTEM_EVENT_Q
MGMT_ADMINMSG_BUS
MGMT_HOST_PING_Q
MGMT_LOADER_Q
MGMT_NOTIFY_INPUT_Q
MGMT_NOTIFY_Q
MGMT_TASK_Q
--------------------------------
Enq
--YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
---
Deq Rtn
--- --YES
0
YES
0
YES
0
YES
0
YES
0
YES
0
YES
0
YES
0
YES
0
YES
0
YES
0
--- ---
Loader Subsystem
All the data collected by agents has to be loaded to the repository. The efficiency of this process can
greatly impact the performance and health of your system overall. A graph showing the Backoff
Requests can be found by doing the following:
1.
Click on Setup / Manage Cloud Control / Health Overview
2.
In the drop down list next to “OMS and Repository” select Monitoring/All Metrics
3.
Look at the Metric for Overall Status as seen in the figure below.
Figure 189: Backoff Requests Metric
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
To monitor the loader process, look for a consistent increase in the Overall Backup Requests in the
Last 10 Mins and the Overall Upload Backlog (Files)/(MB). This is a good indicator as to whether or
not the loader threads are keeping up with incoming data. Higher values for these metrics indicate the
system is backlogged and not keeping up, lower values indicate the loader throughput is efficient. For
additional details on loader metrics and throughput see the Sizing guide.
A loader backlog can cause delays in receiving critical information and notifications. It can also cause
the Agent to stop collecting data once it reaches it maximum threshold, to avoid filling up the file
system it’s installed on. Backlogs can also cause poor console performance and OMS restarts if not
resolved quickly.
Some of the key metrics to watch are:
» Overall Backoff Requests in the Last 10 Mins
» Overall Rows Processed by Loader in the Last Hour
» Overall Upload Backlog (files)
» Overall Upload Backlog (MB)
» Overall Upload Rate (MB/sec)
EM provides a graph showing the Upload Rate and the Upload Backlog as seen below. This graph is
found by clicking on Setup / Manage Cloud Control / Health Overview.
Figure 20: Upload Graph
Loader report
If an OMS is busy processing the uploaded XML files, it may send a backoff request to an agent,
asking the agent to backoff sending the XML files for a specified period of time. EM provides a graph
showing the overall backoff requests for a 24 hour period. A sample of this graph is shown below and
can be found by clicking on Setup / Manage Cloud Control / Health Overview.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 21: Backoff Requests Graph
EM also provides an out-of-the-box report showing loader statistics including the configured loader
resource allocation, loader performance and the agent count broken down by agent priority level. The
available values are None/Mission Critical/Production/Staging/Test/Development. This report is found
under Enterprise / Reports / Information Publisher / Loader Statistics. If the Loader Performance
(3 hours) chart shows a high number of backoff requests and there has not been a recent downtime, it
is an indication that the OMS cannot keep up with the load from the agent. This report will also provide
the priority level of the agents that can be used by the EMDIAG loader_health report as mentioned
below.
EMDIAG also provides a report for the health of the loader subsystem. By using repvfy dump
loader_health you can generate a report of loader health and statistics. The loader_health report will
break down the backoff requests based on priority level (the lifecycle stages of the agent target)of the
agents. It is important to watch for backoff requests for mission critical and production agents. If there
are issues with these agents, contact Oracle Support for help in diagnosing the issue.
No tific a tio n S ub s ys te m
The notification system controls all e-mail, ticket connectors and custom notification methods. For
each event, the notification job checks to see if there’s a required action and submits the task for
processing. A backlog in notifications can cause a delay in alerts being sent, or a missing alert all
together.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
In the Console go to Setup / Manage Cloud Control / Health Overview. Check the Notification
Performance section for a notification backlog. A steady increase needs to be evaluated further using
the guidelines below.
Figure 22: Notification Performance Graph
Select OMS and Repository / Monitoring / All Metrics. From here, validate Notification Status
metric is Up.
Figure 23: Notification Status
To determine if a specific notification queue is having a problem, select Pending Notifications Count
metric as seen below.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 24: Pending Notification Count
There are 4 performance metrics for Notification delivery. By default, there are no Warning/Critical
thresholds. Once you have your system running, evaluate the trend in these metrics and set a
Warning/Critical threshold based on this baseline. The metrics below can be found by selecting Setup
/ Health Overview. Under the drop down list next to the OMS and Repository target, select
Monitoring / All Metrics / Notification Delivery Performance
Average Notification Time (seconds) / Notification Processing Time (% of last hour) – Average
time for notification delivery and the total amount of processing time for notification delivery. If the
average delivery time and notification processing time are both steadily increasing, you have a
performance or capacity problem which will create a risk of not receiving notifications a timely manner.
If the system is not experiencing a general performance problem, examine the notification queue detail
to look for an issue with a specific queue. If a specific issue is not found, contact Oracle Support.
Notifications Processed (Last Hour) - The total number of notifications delivered by the
Management Service over the previous 10 minutes. The metric is collected every 10 mins and no
alerts will be generated. If the number of notifications processed is continually increasing over several
days, consider adding another Management Service.
Pending Notifications Count - Notifications waiting to be delivered. If this number is continually
increasing there is a notification backlog. Look at the view to determine which queue has an issue and
use this to further diagnose the issue.
In addition, you can use the repvfy dump notif_health command to generate a detailed report to
identify Notification statistics and backlogs.
Task Subsystem
EM provides a chart to display the backlog performance of the repository collections as seen in the
example below. This chart can be found by clicking on Setup / Manage Cloud Control / Repository.
A steady increase in backlog indicates a problem that needs to be evaluated.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 25: Repository Collections Graph
Many of the repository collection jobs are divided between short running tasks and long running tasks.
Each EM environment should be configured with a minimum of 3 short running task workers and 2
long running task workers. The performance of these task workers can be monitored via the details in
the Jobs Status chart above. Click on the drop down list to select the Long Running workers. The
graph for 12.1.0.4. now shows more information about the Workers such as the number of collections
in backlog, throughput per second, and average collection duration (seconds) for both short running
and long running workers. The job names are Repository Metrics xx (where xx is a number). The
lower numbers are the short running task workers and the higher numbers are the long running task
workers. Look for any large spikes in processing time or throughput as this could indicate some
occurrence that is generating more work for the repository (i.e. many server outages). If the
throughput for these Repository Metric jobs is consistently high and the backlog is continuous or
grows, then consider adding another task worker.
EM 2.1.0.4 has a new feature here called the Collection Manager. It if found by clicking on the
“Configure” push button as seen in the figure above. The figure below shows the options available
when configuring the Collection Manager. It is recommended to turn this option on if high spikes are
seen in the backlog of tasks at specific times. The Collection Manager will check at specific
frequencies (30 mins) and if the backlog is climbing, a task worker will be added up to the specified
maximum number of workers. When the backlog decreases, the Collection Manager will remove task
workers. It is recommended that the maximum workers not be set higher than 5. If the backlog is not
going down when using up to 5 workers, then contact Oracle Support for further assistance.
Figure 196: Repository Collections Graph
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
To get a health report of the Task sub-system, run this EMDIAG report:
$ repvfy dump task_health
If you suspect a performance problem with the tasks the workers are executing, execute the following
EMDIAG tests to look for ‘expensive’ tasks:
$ repvfy verify repository –test 6013
(short running tasks)
$ repvfy verify repository –test 6014
(long running tasks)
If a ‘rogue/expensive’ task is found, it can be further debugged using the following EMDIAG
commands:
$ repvfy send run_task –id <task id>
$ repvfy dump trace
EM J ob S ys te m
The EM job system is crucial to Enterprise Manager’s health. The majority of background processes
and tasks are run via a series of jobs. Included in these jobs are loading metric data, calculating
availability of composite targets, rollup and purge of metric data and notifications. This Job System is
an OMS subsystem and includes a Job Scheduler and Job Workers. The Job Scheduler in turn
consists of two components, the Job Step Scheduler and the Job Dispatcher. Each of these
components are described in further detail below.
Job Step Scheduler – The Job Step Scheduler is a global component so there is only one per EM
environment. It is scheduled to run by the DBMS Scheduler. The primary purpose of this component
is to look for jobs that need to be executed. Make sure that this job is up. This can be seen by
clicking on Setup / Manage Cloud Control / Repository and looking for the status of the Job Step
Scheduler in the Repository Scheduler Jobs Status section as seen below:
Figure 207: Job Step Scheduler
Job Dispatcher - The EM Job system also has a notion of a 'short' and 'long' job (execution time
wise) and has separate worker pools in the OMS (not in the database as with the job workers) to
handle those requests. The Job Dispatcher runs locally on each OMS and its purpose is to dispatch
the jobs found by the Job Step Scheduler to the job workers. If the dispatcher cannot keep up with the
work in the queue, a backlog is created. This is not a problem as long as the backlog is temporary. If it
is not, then either the dispatcher is not able to keep up with the amount of work which could mean
adding another OMS server or there is a problem with the job workers and they are not able to accept
the work from the dispatcher (see the next section below for details on how to diagnose a job worker
problem.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Job Workers – The Job Workers take work from the Job Dispatcher and send it to the appropriate
agent and they also receive information from the agents. If Job Workers are always busy and never
free, then capacity needs to be added either via another OMS server or by increasing the number of
job workers and potentially increasing the number of db connections (each job worker takes a
connection to the database). EM provides a way to tell if the Job Workers are keeping up with the
dispatched work. If the amount of work the dispatcher is able to give to the job workers approaches
zero, then the workers are not keeping up.
To see the Job Worker details for each OMS server, select Setup / Manage Cloud Control /
Management Services. The top right quarter of the window is titled “Job System”. Under the “Recent
Job Executions Summary” table, click on the link called “More Details…”. This will open a new window
showing the Job Dispatcher details for each OMS server. In that table, the Configured Threads
column is the number of threads configured for each thread pool. The Avg. Threads Available is the
number of threads that are waiting to take work from the dispatcher. See screenshot below (note the
configured threads shown below are the defaults).
Figure 218: Free Threads
» The number of Configured Threads should be the same for each OMS server. The values in the Avg. Steps
Dispatched/Min and Avg. Threads Available columns should be approximately the same for each OMS while
EM is running. If the values are consistently different then one OMS is working harder than the others. At
this point, it is best to contact Oracle Support for further diagnosing.
» If the number for Avg. Threads Availabile is getting close to zero then it means the dispatcher CANNOT
dispatch to ALL the workers in a timely fashion.
» If the Avg. Steps Dispatched/Min is HIGH, there is a resource problem, and the environment could probably
benefit from more worker threads however do not go beyond 'doubling' the size of the threads. If doubling
the number of threads does not seem high enough, contact Oracle as it might be better to add an additional
OMS.
» If the Avg. Steps Dispatched/Min is LOW, but the number of available threads per cycls is also low, this
typically means that either a thread is stuck, or is 'busy for too long'. If this persists, refer to the section
“Omsvfy Commands” in the Use of the emctl dump Options to Collect OMS Log Files [ID 1369918.1] for
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
steps on how to take a thread dump of the OMS processes. It is also possible to use EMDIAG for this
information with this command:
$ omsvfy snapshot oms
Agent Health
The overall health of the environment can also be seen by the status of the Agents. The central view
for all agents can be seen from Setup / Manage Cloud Control / Agents. From here you can
evaluate agents that are blacked out, unreachable, pending or blocked.
Figure 229: Manage Cloud Control Agents Page
This is a very powerful page for EM Administrators as you can issue various agent control commands
from this page, including: startup, shutdown, block, unblock, restart, secure, unsecure. It is possible to
edit agent properties (emd.properties file) or submit a job to edit properties for multiple agents at one
time. For additional details on managing and configuring Agents, see Controlling and Configuring
Management Agents in the Oracle Enterprise Manager Cloud Control Administrator’s Guide.
A significant percentage of agents down or not responding indicates an unhealthy environment and a
lack of proper monitoring. The goal is to have 100% agent availability. Spot check the agent health
daily watching for a significant increase in the percentage of problem agents and checking the alerts
for the problem agents, correcting those that are creating issues (pinging, etc). On a bi-weekly basis,
take the time to fix those agents that have shown problems for several days.
Starting with version 12.1.0.3.0, EM now has more details for the status of the targets. For example,
the status for a target that has recently been discovered may be “Diagnose for Status Pending (Target
Addition in Progress) or for a host that is up but who’s agent is down or unreachable would be
“Diagnose for Up (Unmonitored). These more detailed statuses can be seen in several new locations.
They are reflected in the All Targets tab via the new icons. Note: clicking on the icon on this page will
open the Symptom Analysis page which will provide details on the possible root cause and resolution.
The new target status details can also be seen on the Status Graph found on the Enterprise Summary
page. To see the breakdown of the different sub-status, click on the Unknown Status for this graph
and a pop-up window will open detailing the sub-status breakdown as seen in the figure below.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Figure 30: Enterprise Summary Target Status - Unknown
The new statuses are also represented on the individual target’s home page and at the top of the
agent’s home page. The Incidents and Problems section will have an incident for this status. Clicking
on the incident will open the incident details page containing recommendations/documentation in the
Guided Resolution section for addressing the particular target status.
A large number of agents in the “Agent Unreachable”, “Status Pending” and/or
“Blocked/Misconfigured” status indicate that these targets are not being properly monitored/managed.
Click on the status type in the summary line with the most problematic agents to get a list of these
agents and begin diagnosing to resolve the issues. Basic agent troubleshooting steps to be followed:
TABLE 9:AGENT TROUBLES HOOTING
Check
Notes
Host Up
Check to verify if the host is up.
If not, is the host still valid? Many times hosts are decommissioned but not removed
from monitoring.
Agent Up
Check to verify if the agent is up: emctl status agent
Start agent if necessary
Agent Uploading
In the emctl status agent, check for messages about heartbeat/upload. Attempt upload
with emctl upload
OMS Reachable
Ping the oms from the agent, and agent from the OMS, ensure ports are not blocked
by firewalls
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Check Logs
$EMSTATE/agent_inst/sysman/log/ (Ex: /u01/app/oracle/em/agent_inst/sysman/log)
» gcagent.log – contains trace, debug, information, error or warning messages from
the agent.
» gcagent_skd.trc – logging about fetchlets and receivelets
» gcagent_mdu.log – tracks the metadata updates to the agent
emctl.log – information from the execution of the emctl commands.
Agent Dump
If the agent is still not uploading or reachable, run a target and availability dump on the
agent target from repvfy.
Repvfy dump target –name <agent:port>
Repvfy dump availability –name <agent:port>
REPVFY can also be used to get an overview of agent health by running a repvfy dump
agent_health report. The command will provide details about the agent such as agent ping statistics,
agent down statistics and system errors.
Events and Incidents
It is also necessary to review Critical or Warning errors which could indicate an underlying issue and
lead to an outage. Also, large amounts of alerts cause a performance impact on the EM system.
Metric errors indicate that data is not being collected or monitored properly, and these should be
resolved to have an accurate picture of the current system status. For detailed look at using Incident
Manager see the Oracle Enterprise Manager 12c Cloud Control Administrator’s Guide. Below is a list
of some of the places to check for events and/or incidents.
1.
OMS and Repository Events and Incidents – Click on Setup / Manage Cloud Control / Health
Overview . Then from the target menu select OMS and Repository / Monitoring / Incident Manager.
This will filter the events and incidents to those related to the OMS and Repository targets. The default
view is All open incidents.
Figure 31: Open Incidents Page
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Click on Events without incidents to see additional events. Depending on your incident rules, you
may not be receiving an incident for each event. For details on how to create the recommended rule
set to ensure notifications are sent to the EM administrator, refer to Oracle Enterprise Manager
12.1.0.4 Configuration Best Practices [1929586.1].
Figure 32: Open Events Page
Clicking on an individual message will provide more details for that particular alert. Look for
repeating messages and address these first. Some alerts must be manually closed, such as TNS
errors or alert log errors. These will have an additional action of Close as seen below. Clearing
these errors regularly helps maintain a clean environment. This can also be done with the EM CLI
utility using the clear_stateless_alerts flag.
Figure 33: Event Detail
• Note: You may see BEA-337 [WebLogicServer] errors coming from WebLogic Server. By default
WLS will ping applications and wait for a response for up to 600 seconds. EM will keep threads
running as long as there is work in the queue so they will not respond to a heartbeat, causing WLS to
timeout and error. To work around this, increase the stuck thread timeout in the Admin server. This is
done by logging onto the WLS Admin server. Click on Environment in the top right side menu and
expand Servers. For each server, click on the server name and then on the Tuning tab on in the
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
middle window. Change the value for Stuck Thread Max Time to 1800. Save and activate the change.
This will require a restart of the OMS server.
Figure 34: Timeout Error
2.
Target Incidents - Click on Enterprise / Monitoring / Incident Manager. The list of incidents can be
sorted by clicking on the column heading. To find the highest number of repeating error messages to
address first, click on the Summary column to sort by error message.
3.
System Errors – EM 12c provides a system error log page. This page details the errors found on the
repository and/or the management services. The URL to this page is
http://your_em_link/em/console/health/healthSystemError. This page will provide information such as the
component type, the agent monitoring that component, date and time of the error, level, and the error
message text. It is used for advanced fault research and should only need to be reviewed to help resolve
a problem that has not been resolved through any of the other event and incident management tools. It is
best to work with Oracle Support for help in resolving these issues.
Figure 35: Health System Error Page
Log & Trace Files
As part of diagnosing problems with the different EM components, it is important to review the log and
trace files for these components. The table below details the standard location for log and trace files
broken down by the different components. For more details on managing log files, refer to Oracle
Enterprise Manager 12c Cloud Control Administrator’s Guide.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
TABLE 10:LOG/TRACE FILES
EM
Component
Oracle
Management
Agent
Log Files
Trace Files
$EMSTATE/sysman/log (“emctl getemhome” will return
the location for $EMSTATE)
$EMSTATE/diag/ofm/emagent/emagent/trace
Ex:
/u01/app/oracle/em/agent_inst/sysman/log
Oracle
Management
Service
$MWARE/gc_inst/em/<OMSNAME>/sysman/log (where
$MWARE is the middleware home and OMSNAME is the
name of the oms instance ex: EMGC_OMS1)
Oracle HTTP
<EM_INSTANCE_BASE>/<webtier_instance_name>/diag
nostics/logs/OHS/<ohs_name>
Server (OHS)
$MWARE/gc_inst/em/<OMSNAME>/sysman/log (where
$MWARE is the middleware home and OMSNAME is the
name of the oms instance ex: EMGC_OMS1)
Ex:
/u01/app/oracle/MWare/gc_inst/WebTierIH1/diagnostics/lo
gs/OHS/ohs1
OPMN
<EM_INSTANCE_BASE>/<webtier_instance_name>/diag
nostics/logs/OPMN/<opmn_name>
Ex:
/u01/app/oracle/MWare/gc_inst/WebTierIH1/diagnostics/lo
gs/OPMN/opmn
Oracle WebLogic
<EM_INSTANCE_BASE>/user_projects/domains/<domai
n_name>/servers/<SERVER_NAME>/logs/<SERVER_NA
ME>.log
Ex:
/u01/app/oracle/MWare/gc_inst/user_projects/domains/G
CDomain/servers/EMGC_OMS1/logs
Incident Files
OMS Inc id en t File s
Any errors in these log files indicate product defects (bugs). Open an SR with Oracle Support for
these issues. There are two different locations for the Automatic Diagnostic Repository (ADR)
incidents created on the OMS servers. These are as follows:
WebLogic Server incidents:
<EM_INSTANCE_BASE>/user_projects/domains/<domain_name>/servers/<SERVER_NAME>/adr/di
ag/ofm/EMGC_DOMAIN/EMOMS/incident
Ex:
/u01/app/oracle/MWare/gc_inst/user_projects/domains/GCDomain/servers/EMGC_OMS1/adr/diag/ofm
/EMGC_DOMAIN/EMOMS/incident
EMS incidents:
<EM_INSTANCE_BASE>/user_projects/domains/<domain_name>/servers/<SERVER_NAME>/adr/di
ag/ofm/<domain_name>/<SERVER_NAME>/incident
Ex:
/u01/app/oracle/MWare/gc_inst/user_projects/domains/GCDomain/servers/EMGC_OMS1/adr/diag/ofm
/GCDomain/EMGC_OMS1/incident
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Ag e n t In cid e nt File s
The ADR incidents created for the Agent are found here:
$EMSTATE/diag/ofm/emagent/emagent/incident
Ex:
/u01/app/oracle/em/agent_inst/diag/ofm/emagent/emagent/incident
NOTE: For more details on gathering incident information, refer to 12c Cloud Control: How to Invoke
ADR Command Interpreter (adrci) in OMS or Agent Home? [1512905.1]
Troubleshooting
The following table lists high-level process flows for troubleshooting various issues with Enterprise
Manager.
TABLE 11:TROUBLESHOOTING
Issue
Component
Performance
» RUN REPVFY EXECUTE OPTIMIZE (FOR FURTHER DETAILS ON THIS REPVFY COMMAND,
REFER TO Oracle Enterprise Manager 12.1.0.4 Configuration Best Practices [1929586.1])
» EVALUATE DB PERFORMANCE, LOCKS, WAITS, ETC.
» LOOK FOR ADDM RECOMMENDATIONS
» VALIDATE SYSMAN STATISTICS
» RUN REPVFY DUMP PERFORMANCE
» RUN REPVFY DUMP ERRORS
Jobs
»
»
»
»
CHECK DBMS_SCHEDULER STATUS
CHECK VALUE OF JOB_QUEUE_PROCESSES
RUN REPVFY DUMP JOB_HEALTH
CHECK FOR ERRORS RELATING TO A SPECIFIC JOB FAILURE (SEE MOS NOTE 744645.1 TO
IDENTIFY THE JOB)
» REFER TO MOS NOTES 783357.1 AND 1520580.1 FOR FURTHER HELP IN DIAGNOSING AN
ISSUE WITH JOBS
Notifications- if a notification
» Check event/incident details to see if Notification was triggered
» Check EM Jobs Service – Notification Job
» Run repvfy dump notif_health
is missing or late
Events – missing event or
incident
»
»
»
»
Check for loader backlog (repvfy dump loader_health)
Check agent status (not blocked, uploading?)
Check target thresholds
Check incident rules
OMS Availability – see MOS
» Verify that the repository database and listener are up
note 1432335.1 for details on » Verify that the sysman, sysman_opss, sysman_mds user accounts in the repository database are open
» Check log files (see MOS note 1448308.1)
OMS Process Control
» emctl - <EM_INSTANCE_BASE>/em/EMGC_OMSn/sysman/log
» OPMN - <EM_INSTANCE_BASE>/WebTierIH1/diagnostics/logs/OPMN/opmn
» HTTP_SERVER - <EM_INSTANCE_BASE>/WebTierIH1/diagnostics/logs/OHS/ohs1
» EM Node Manager - <EM_INSTANCE_BASE>/NodeManager/emnodemanager
» Admin Server <EM_INSTANCE_BASE>/user_projects/domains/GCDomain/servers/EMGC_ADMINSERVER/
logs
» EM Managed Server <EM_INSTANCE_BASE>/user_projects/domains/GCDomain/servers/EMGC_OMS1/logs
» For diagnosing issues with connectivity between OMS and the Repository, refer to MOS note:
1448007.1
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Target Availability
» Check gcagent.log for ERROR messages
» Run repvfy dump target
» Run repvfy dump availability
The table below shows the different target availability states and the guided resolution
recommendations.
TABLE 12:TARGET AVAILABILITY S TATES
Availability
State
NA
Down
Icon
Guided Resolution Recommendations
N/A
If the target was brought down as part of a planned maintenance, consider creating a blackout on the target. If
the target was brought down in error, restart it by going to the target homepage, target menu -> Control ->
Start up. If the target status is not correct, refer to My Oracle Support article Enterprise Manager 12c: How to
run the \"Targets Status Diagnostics Report\" to Troubleshoot Target Status Availability Issues (up, down,
metric collection error, pending, unreachable) for all Targets (Doc ID 1546575.1)
Up
Error
To troubleshoot, refer to My Oracle Support article Enterprise Manager 12c: How to run the \"Targets Status
Diagnostics Report\" to Troubleshoot Target Status Availability Issues (up, down, metric collection error,
pending, unreachable) for all Targets (Doc ID 1546575.1)
Agent Down
Agent Down
Target is up but Agent is down. Start the Agent
Target Up
Unmonitored
Unreachable
Unreachable
If agent was brought down in error, restart it by go to the agent homepage, menu \"Agent -> Control -> Start
Target Down
up...\". If agent was brought down as part of planned maintenance, consider creating a blackout on the agent.
Unreachable
If agent was brought down in error, restart it by go to the agent homepage, menu \"Agent -> Control -> Start
Agent Down
up...\". If agent was brought down as part of planned maintenance, consider creating a blackout on the agent
Agent
To troubleshoot, go to the agent homepage and run the Symptom Analysis tool located next to the Status
Unreachable
field. Also, refer to My Oracle Support article Enterprise Manager 12c: How to run the \"Targets Status
Target Up
Diagnostics Report\" to Troubleshoot Target Status Availability Issues (up, down, metric collection error,
Unmonitored
pending, unreachable) for all Targets (Doc ID 1546575.1)
Under Migration
Agent is unreachable as it is under migration
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Unreachable
Agent cannot write to file system. Check Agent file system. To troubleshoot, go to the agent homepage and
Readonly
run the Symptom Analysis tool located next to the Status field. Also, refer to My Oracle Support article
Filesystem
Enterprise Manager 12c: How to run the \"Targets Status Diagnostics Report\" to Troubleshoot Target Status
Availability Issues (up, down, metric collection error, pending, unreachable) for all Targets (Doc ID 1546575.1)
Unreachable
Agent Collections have been disabled. Check that Agent can upload to OMS. To troubleshoot, go to the agent
Collection
homepage and run the Symptom Analysis tool located next to the Status field. Also, refer to My Oracle
Disabled
Support article Enterprise Manager 12c: How to run the \"Targets Status Diagnostics Report\" to Troubleshoot
Target Status Availability Issues (up, down, metric collection error, pending, unreachable) for all Targets (Doc
ID 1546575.1)
Unreachable
Agent file system is full. Check available space. To troubleshoot, go to the agent homepage and run the
Disk Full
Symptom Analysis tool located next to the Status field. Also, refer to My Oracle Support article Enterprise
Manager 12c: How to run the \"Targets Status Diagnostics Report\" to Troubleshoot Target Status Availability
Issues (up, down, metric collection error, pending, unreachable) for all Targets (Doc ID 1546575.1).
Unreachable
Agent is unreachable as its first severity has not yet come after blackout end.
Blackout
Unreachable
Agent has been blocked manually. Unblock the Agent.
Agent Block
Manual
Unreachable
Agent has been blocked due to Plug-in mismatch. If Agent has been restored from a backup perform an
Agent Block
Agent Resync
Plugin Mismatch
Unreachable
Agent has been blocked due to Bounce Counter mismatch. If Agent has been restored from a backup perform
Agent Block
an Agent Resync
Counter
Unreachable
Agent is configured for communication with another OMS. Check Agent configuration.
Agent
Misconfiured
Unreachable
Agent is unreachable due to communication break between agent and the OMS
Agent
Communication
Broken
Blackout
Unknown
To troubleshoot, refer to My Oracle Support article Enterprise Manager 12c: How to run the \"Targets Status
Diagnostics Report\" to Troubleshoot Target Status Availability Issues (up, down, metric collection error,
pending, unreachable) for all Targets (Doc ID 1546575.1).
Status Pending
Target addition is in progress. To troubleshoot, refer to My Oracle Support article Enterprise Manager 12c:
Add Target
How to run the \"Targets Status Diagnostics Report\" to Troubleshoot Target Status Availability Issues (up,
down, metric collection error, pending, unreachable) for all Targets (Doc ID 1546575.1)
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Status Pending
Blackout has recently ended on this Target and Availability Status is pending. To troubleshoot, refer to My
Blackout Ended
Oracle Support article Enterprise Manager 12c: How to run the \"Targets Status Diagnostics Report\" to
Troubleshoot Target Status Availability Issues (up, down, metric collection error, pending, unreachable) for all
Targets (Doc ID 1546575.1)
Status Pending
Metric error has recently ended on this Target and Availability Status is pending. To troubleshoot, refer to My
Error
Oracle Support article Enterprise Manager 12c: How to run the \"Targets Status Diagnostics Report\" to
Troubleshoot Target Status Availability Issues (up, down, metric collection error, pending, unreachable) for all
Targets (Doc ID 1546575.1).
Conclusion
As an environment grows for any enterprise, the dependency on Oracle Enterprise Manager 12c to
help monitoring and administer the environment becomes very important. This also means that the
EM environment itself must be supported, maintained and treated as highly available as the most
highly available target it manages. Therefore, this means that EM must be properly configured,
monitored, maintained and high performing to provide the daily monitoring and administration
capabilities that an enterprise requires to maintain their environment.
OPERATIONAL CONSIDERATIONS AND TROUBLESHOOTING FOR ORACLE ENTERPRISE MANAGER 12.1.0.4
Oracle Corporation, World Headquarters
Worldwide Inquiries
500 Oracle Parkway
Phone: +1.650.506.7000
Redwood Shores, CA 94065, USA
Fax: +1.650.506.7200
CONNECT W ITH US
blogs.oracle.com/oracle
twitter.com/oracle
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. This document is provided for information purposes only, and the
contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other
warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or
fitness for a particular purpose. We specifically disclaim any liability with respect to this document, and no contractual obligations are
formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any
means, electronic or mechanical, for any purpose, without our prior written permission.
oracle.com
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
facebook.com/oracle
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and
are trademarks or registered trademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are
trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group. 1014