You can find the following components on every instance of an Oracle 8i Parallel Server database:
- A cluster manager. This is OS vendor specific (except for Windows) and includes a node-monitoring facility and failure-detection mechanism.
- A distributed lock manager (DLM). The DLM includes deadlock detection and resource mastering.
- Cluster interconnect.
- A shared disk array.
Table of contents[Show]
Vendor-specific cluster management software is not discussed here and is mentioned only for the sake of completeness. The vendor product is the basic mandatory cluster software needed before installing and configuring OPS and some variants of earlier versions of RAC. It is generally called the cluster manager (CM) and has its own group membership services, node monitor, and other core layers.
Cluster Group Services (CGS)
One of the key “hidden” or lesser-known components of OPS is Cluster Group Services (CGS). CGS had some OSD components (such as the node monitor interface), and the rest of it is the GMS part built into the Oracle kernel. CGS holds a key repository used by the DLM for communication and network-related activities. This layer in the Oracle 8i kernel (and beyond) provides the following key facilities without which an OPS database cannot operate:
- Internode messaging
- Group membership consistency
- Cluster synchronization
- Process grouping, registration, and deregistration
A whole set of cluster communication interfaces and APIs became an internal part of the Oracle code in OPS 8i. GMS provided many services (such as member status and node evictions) that were external in Oracle 8 but were made internal in Oracle 8i.
Distributed Lock Manager (DLM)
The DLM keeps an inventory of all the locks and global enqueues that are held by all the instances in the OPS database. Its job is to track every lock granted to a resource. The requests from various instances for lock acquisition and release are coordinated by the DLM. The memory structures required by DLM to operate are allocated out of the shared pool. The lock resources, messages buffers, and so on are all in the shared pool of each instance. The DLM is designed such that it can survive node failures in all but one node of the cluster.
The DLM is always aware of the current holders or requestors of locks and the grantee. In case locks are not available, the DLM queues the lock requests and informs the requestor when a lock resource becomes available. Some of the resources that the DLM manages are data blocks and rollback segments. Oracle resources are associated with the DLM locks by instance, using a complex hashing algorithm. The lock and enqueue functionality in OPS is the same as in a single-instance RDBMS server, except that OPS takes a global view.
DLM relies on the core RDBMS kernel for locking and enqueues services. The DLM coordinates locking at the global level, and this is a service that the core layers don’t provide.
Locking Concepts in Oracle Parallel Server
In an OPS database, users must acquire a lock before they can operate on any resource. This is also applicable to a single-instance scenario. In pure DLM terminology, a resource is any object accessed by users, and a lock is a client operational request of a certain type or mode on that resource.
Parallel Cache Management (PCM) means that the coordination and maintenance of data blocks occur within each data buffer cache (of an instance) so that the data viewed or requested by users is never inconsistent or incoherent. The access to data is controlled via the PCM framework using data blocks with global coordinated locks. In simple terms, PCM ensures that only one instance in a cluster can modify a block at any given time. Other instances have to wait.
Broadly speaking, locks in OPS are either PCM locks or non-PCM locks. PCM locks almost exclusively protect the data blocks, and non-PCM locks control access to data files, control files, data dictionary, and so on. PCM locks are static in OPS, and non-PCM locks are dynamically set using the init.ora settings of certain parameters.
Locks on PCM resources are referred to as “lock elements” and non-PCM locks are called “enqueues.” DLM locks are acquired on a resource and are typically granted to a process. PCM locks and row-level locks operate independently.
PCM Lock and Row Lock Independence
PCM locks and row locks operate independently. An instance can disown a PCM lock without affecting row locks held in the set of blocks covered by the PCM lock. A row lock is acquired during a transaction. A database resource such as a data block acquires a PCM lock when it is read for update by an instance. During a transaction, a PCM lock can therefore be disowned and owned many times if the blocks are needed by other instances.
In contrast, transactions do not release row locks until changes to the rows are either committed or rolled back. Oracle uses internal mechanisms for concurrency control to isolate transactions so modifications to data made by one transaction are not visible to other transactions until the transaction modifying the data commits. The row lock concurrency control mechanisms are independent of parallel cache management: concurrency control does not require PCM locks, and PCM lock operations do not depend on individual transactions committing or rolling back.
IDLM lock modes and Oracle lock modes are not identical, although they are similar. In OPS, locks can be local or global, depending on the type of request and operations. Just as in a single instance, locks take the form
<Type, ID1, ID2>
where Type consists of two characters, and ID1 and ID2 are values dependent on the lock type. The ID is a 4-byte, positive integer.
Local locks can be divided into latches and enqueues. These could be required for local instance-based operations. A shared pool latch is a simple example of a local latch, irrespective of whether or not OPS is present. Enqueues can be local or global. They take on a global role in an OPS environment and remain local in a single instance. A TX (transaction) enqueue, control file enqueue (CF), DFS (Distributed File System) enqueue lock, and a DML/table lock are examples of global enqueues in an OPS database. The same enqueue is local in a single-instance database. Similarly, data dictionary and library cache locks are global in an OPS environment.
Local locks provide transaction isolation or row-level locking. Instance locks provide for cache coherency while accessing shared resources. GV$LOCK and GV$LOCK_ELEMENT are important views that provide information on global enqueues and instance locks.
Originally, two background processes—Lock Manager Daemon (LMD) and Lock Monitor (LMON)—implemented the DLM (refer to Figure 2-6). Each instance had its own set of these two processes. The DLM database stored information on resources, locks, and processes. In Oracle 9i, the DLM was renamed Global Cache Services (GCS) and Global Enqueue Services (GES).
DLM Lock Compatibility Matrix
Every resource in an Oracle RAC environment is identified by its unique resource name. Each resource can potentially have a list of locks currently granted to users. This list is called the “Grant Q.” Locks that are in the process of converting or waiting to be converted from one mode to another are placed on the “Convert Q” of that resource. For each lock, a resource structure exists in memory that maintains a list of owners and converters. Each owner, waiter, and converter has a lock structure, as shown in Table 1.
TABLE 1. DLM Lock Compatibility Matrix
Every node has directory information for a set of resources it manages. To locate a resource, the DLM uses a hashing algorithm based on the name of the resource to find out which node holds the directory information for that resource. Once this is done, a lock request is given directly to this “master” node. The directory area is nothing but a DLM memory structure that stores information about which node masters which blocks.
The traditional lock-naming conventions (such as SS, SX, X) are provided in Table 2 along with the corresponding DLM mode.
TABLE 2. Conventional Naming vs. DLM Naming
Note that in this table, NL means null mode; CR/SS means concurrent read mode; CW/SX means concurrent write mode; PR/S means protected read mode; PW/SSX means protected write mode; and EX/X means exclusive mode.
Lock Acquisition and Conversion
Locks granted on resources are in the Grant Q (as discussed previously). Locks are placed on a resource when a process acquires a lock on the Grant Q of that resource. It is only then that a process owns a lock on that resource in a compatible mode.
A lock can be acquired if there are no converters and the mode the Oracle kernel requires is compatible with the modes already held by others. Otherwise, it waits on the Convert Q until the resource becomes available. When a lock is released or converted, the converters run the check algorithm to see if they can be acquired.
Converting a lock from one mode to another occurs when a new request arrives for a resource that already has a lock on it. Conversion is the process of changing a lock from the mode currently held to a different mode. Even if the mode is NULL, it is considered as holding a lock. Conversion takes place only if the mode required is a subset of the mode held or the lock mode is compatible with the modes already held by others, according to a conversion matrix within the IDLM.
Processes and Group-Based Locking
When lock structures are allocated in the DLM memory area, the operating system process ID (PID) of the requesting process is the key identifier for the requestor of the lock. Mapping a process to a session is easier inside Oracle, and the information is available in V$SESSION. However, in certain clients, such as Oracle Multithreaded Server (MTS) and Oracle XA, a single process may own many transactions. Sessions migrate across many processes to make up a single transaction. This would disable identification of the transaction and the origin of the same. Hence, lock identifiers had to be designed to have session-based information, where the transaction ID (XID) is provided by the client when lock requests are made to the DLM.
Groups are used when group-based locking is used. This is preferred particularly when MTS is involved, and when MTS is used the shared services are implicitly unavailable to other sessions when locks are held. From Oracle 9i onward, process-based locking no longer exists. Oracle 8i OPS (and later) uses group-based locking irrespective of the kind of transactions. As mentioned, a process within a group identifies itself with the XID before asking Oracle for any transaction locks.
The DLM maintains information about the locks on all nodes that are interested in a given resource. The DLM nominates one node to manage all relevant lock information for a resource; this node is referred to as the “master node.” Lock mastering is distributed among all nodes.
Using the Interprocess Communications (IPC) layer enables the distributed component of the DLM to share the load of mastering (administering) resources. As a result, a user can lock a resource on one node but actually end up communicating with the LMD processes on another node. Fault tolerance requires that no vital information about locked resources be lost, regardless of how many DLM instances fail.
Communication between the DLM processes (LMON, LMD) across instances is implemented using the IPC layer across the high-speed interconnect. To convey the status of a lock resource, the DLM uses asynchronous traps (ASTs), which are implemented as interrupts in the OS handler routines. Purists may differ on the exact meaning of AST and the way it is implemented (using interrupts or other blocking mechanism), but as far as OPS and Oracle RAC are concerned, it is an interrupt. AST can be a blocking AST or an acquisition AST.
When a process requests a lock on a resource, the DLM sends a blocking asynchronous trap (BAST) to all processes that currently own a lock on that same resource. If possible and necessary, the holder(s) of the lock may relinquish it and allow the requester to gain access to the resource. An acquisition AST (AAST) is sent by DLM to the requestor to inform him that it now owns the resource (and the lock). An AAST is generally regarded as a “wakeup call” for a process.
How Locks Are Granted in a DLM
To illustrate how locking works in OPS’s DLM, consider a simple two-node cluster with a shared disk array:
- Process p1 needs to modify a data block on instance 1. Before the block can read into the buffer cache on instance 1, p1 needs to check whether a lock exists on that block.
- A lock may or may not exist on this data block, and hence the LCK process checks the SGA structures to validate the buffer lock status. If a lock exists, LCK has to request that the DLM downgrade the lock.
- If a lock does not exist, a lock element (LE) has to be created by LCK in the local instance, and the role is local.
- LCK must request the DLM for the LE in exclusive mode. If the resource is mastered by instance 1, DLM continues processing. Otherwise, the request must be sent to the master DLM in the cluster.
- Assuming the lock is mastered on instance 1, the DLM on this instance does a local cache lookup in its DLM database and finds that a process on instance 2 already has an exclusive (EX) lock on the same data block.
- DLM on instance 1 sends out a BAST to DLM on instance 2 requesting a downgrade of the lock. DLM on instance 2 sends another BAST to LCK on the same instance to downgrade the lock from EX to NULL.
- The process on instance 2 may have updated the block and may not have committed the changes. The Dirty Buffer Writer (DBWR) is signaled to write out the block to disk. After the write confirmation, the LCK on instance 2 downgrades the lock to NULL and sends an AAST to DLM on the same instance.
- DLM on instance 2 updates its local DLM database about the change in lock status and sends an AAST to DLM on instance 1.
- The master DLM on instance 1 updates the master DLM database about the new status of the lock (EX), which can now be granted to the process on its instance. DLM itself upgrades the lock to EX.
- DLM on instance 1 now sends another AAST to the local LCK process informing it about the lock grant and that the block can be read from disk.
Cache Fusion Stage 1, CR Server
OPS 8i introduced Cache Fusion Stage 1. Until version 8.1, cache coherency was maintained using the disk (ping mechanism). Cache Fusion introduced a new background process called the Block Server Process (BSP). The major use or responsibility of BSP was to ship a consistent read (CR) version of a block (or blocks) across instances in a read/write contention scenario. The shipping was done using the high-speed interconnect and not the disk. This was called Cache Fusion Stage 1 because it was not possible to transfer all types of blocks to the requesting instance, especially with the write/write contention scenario.
Cache Fusion Stage 1 laid the foundation for Oracle RAC Cache Fusion Stage 2, in which both types of blocks (CR and CUR) can be transferred using the interconnect, although a disk ping is still required in some circumstances.
Oracle 8i also introduced the
GV$ views, or “global views.” With the help of GV$ views, DBAs could view cluster-wide database and other statistics sitting on any node/instance of the cluster. This was of enormous help to DBAs, because previously they had to group together data collected on multiple nodes to analyze all the statistics.
GV$ views have the instance_number column to support this functionality.
Block contention occurs when processes on different instances need access to the same block. If a block is being read by instance 1 or is in the buffer cache of instance 1 in read mode, and another process on instance 2 requests the same block in read mode, read/read contention results. This situation is the simplest of all cases and can easily be overcome because there are no modifications to the block. A copy of the block is shipped across by BSP from instance 1 to instance 2 or read from the disk by instance 2 without the need to apply an undo to get a consistent version. In fact, PCM coordination is not required in this situation.
Read/write contention occurs when instance 1 has modified a block in its local cache, and instance 2 requests the same block for a read. In Oracle 8i, using Cache Fusion Stage 1, instance locks are downgraded, and the BSP process builds a CR copy of the block using the undo data stored in its own cache and ships the CR copy across to the requesting instance. This is done in coordination with DLM processes (LMD and LMON).
If the requesting instance (instance 2) needs to modify the block that instance 1 has already modified, instance 1 has to downgrade the lock, flush the log entries (if not done before), and then send the data block to disk. This is called a “ping.” Data blocks are pinged only when more than one instance needs to modify the same block, causing the holding instance to write the block to disk before the requesting instance can read it into its own cache for modification. Disk ping can be expensive for applications in terms of performance.
A false ping occurs each time a block is written to disk, even if the block itself is not being requested by a different instance, but another block managed by the same lock element is being requested by a different instance.
A soft ping occurs when a lock element needs to be down-converted due to a request of the lock element by another instance, and the blocks covered by the lock are already written to disk.
This situation occurs when both instances have to modify the same block. As explained, a disk ping mechanism results when the locks are downgraded on instance 1 and the block is written to disk. Instance 2 acquires the exclusive lock on the buffer and modifies it.