Wednesday, November 3, 2021

Oracle Exadata x8m and the PMEMCache and PMEMLog

Oracle Exadata x8m provides Optane persistent memory (PMEM) and in 2 ways PMEMCache and PMEMLog which is configured automatically with installation of the Exadata Software.  Keep in mind that to take full advantages of the PMEMCache and PMEMLog you will need to be running Oracle Database 19c and above using and Exadata x8m with RoCE network.   For databases below 19c blocks even with RoCE network Exadata will acesss the persistent memory via the pre-existing Exadata I/O path to the storage cells which can provide some improvement, but maximum performance advantage is 19c and above databases with Exadata and R0CE will using the Remote Direct Memory Access (RDMA) to access the persistent memory on the storage cells.  The good news is there is no additional configuration, application changes other action you will need to take to get the advantages of the new PMEMCache and PMEMLog on the Exadata.

 

For persistent memory the data access speed is slower than DRAM and faster than SSD.  As a way of comparison (this is approximate and meant to illustrate the large gains this may vary some) DRAM is about 80-100 nanoseconds, Flash SSD storage around 200 microsecond and spinning hard disk is between 1-9 milliseconds, while persistent memory access times sits at about 300 nanoseconds. This shows persistent memory is about 3x slower than DRAM, but is 600x faster than Flash SSD and 3,000x faster than spinning hard disk.  This will allow Exadata to take advantage in some cases of faster than flash disk speed improving performance by having a layer after DRAM and before the Flash Cache.

 

The persistently memory is automatically replicated across all storage servers, this added multi-path access to all data in the persistently memory, but also adds a large layer of resiliency.  The persistent memory also is only accessible via Oracle databases therefore using database access controls as using persistent memory via OS or local access is not possible.  By only allowing the database to access the PMEM you can be ensured that the data is secure as database controls the access and will maintenance the consistently and access control to the data.  Exadata hardware monitoring and fault management is performed via ILOM and includes persistent memory hardware modules.  When the time comes to remove the Exadata or reinstall storage servers secure erase will automatically run on persistent memory modules therefore ensuring that no data remains when deinstall or reinstall is done.

 

The persistent memory (PMEM) PMEMCache adds a storage tier of between DRAM (local server memory) and Flash (flashcache). The Exadata X8M adds 1.5 TB of persistent memory to High Capacity and Extreme Flash Storage Servers. The Persistent memory enables reads to happen at near local memory speed, and ensures writes survive any power failures that may occur and can be accessed via an 19c database from all database servers in the Exadata rack.

 

The Exadata X8M Storage Servers transparently manage the persistent memory in front of flash memory.  The Exadata Database Servers running Oracle Database 19c or above accesses the Optane persistent memory (PMEM) directly in the Exadata Storage Servers which is made possible by the converged Ethernet (RoCE) switches with 100g capability and bypasses the network, storage controller, IO software, interrupts, and context switches which delivers ≤ 19µs latency and as much as 16 million 8K IOPS per rack.  Many database functions and all storage functions are handled by the Exadata Storage Servers freeing up the resource on the Exadata Database Servers improving performance.

 

Database Server (Database 19c or Above)

    |                                         |

PMEM  -> Really Hot            |

                            |                 |

                           FLASH  -> Hot

    |                       |

 DISK      -> Colder

 

In Oracle Database 19c the database can put the redo log directly on the persistent memory (PMEM) PMEMLog  on multiple storage servers using RDMA via the converged Ethernet (RoCE) 100g network.  Keep in mind that is not the database’s entire redo log, it only contains the recently written records.  Since the database uses RDMA for writing the redo logs the redo log writes are up to an 8x faster. Since the redo log is going to PMEM and PMEM is duplicated on multiple storage servers, it provides resilience for the redo.  So, for example if your database seems high log file sync waits at times this could help with that issue.  When on Exadata x8m and Oracle database 19c it is not recommended to have storage cells in write-back mode due to the use of the PMEMLog and would write to both when in write-back mode.

 

The Oracle Database 19c and above has AWR statistics from various Exadata storage server components which includes the persistent memory (PMEM) for both the PMEMCache and PMEMLog in addition to Smart I/O, Smart Flash Cache, Smart Flash Log, PMEM Cache. It includes I/O statistics from both the operating system and the cell server software, and it will perform outlier analysis using the I/O statistics. Statistics from PMEM cache  different because the database issues RDMA I/O directly to PMEM cache and does not go through cellsrv, so the storage server does not see the IOs via RDMA via R0CE therefore there are no cell metrics for PMEM cache I/O. Instead, Oracle Database statistics account for the I/O that is performed using RDMA. 

 

** AWR Report Examples are from Oracle documentation found here:

https://docs.oracle.com/en/engineered-systems/exadata-database-machine/sagug/exadata-storage-server-monitoring.html#GUID-0A5BBB1E-1C0D-4D6A-BFF9-AD2931B913CB

 

The AWR Report will have a section the shows the PMEM configuration the example show with write through.

 


The AWR Report will report on PEME Cache space usage as a summary as well as detail per storage cell.


The section on the PMEM Cache Internal Writes are from the RDMA writes made to the PMEM that are populating the PMEM Cache.


We can also see PMEM information from each storage cells using the cellcli command line utility for Example:

 


CellCLI> LIST METRICDEFINITION ATTRIBUTES NAME,DESCRIPTION  WHERE OBJECTTYPE = "PMEMCACHE";

         PC_BY_ALLOCATED        "Number of megabytes allocated in PMEM cache"

 

CellCLI> list metriccurrent where name = 'PC_BY_ALLOCATED' ;

         PC_BY_ALLOCATED         PMEMCACHE       1,541,436 MB

 

 

CellCLI> list metriccurrent where name = 'DB_PC_BY_ALLOCATED' ;

         DB_PC_BY_ALLOCATED      ASM                    0.000 MB

         DB_PC_BY_ALLOCATED      DT4DB1               802,271 MB

         DB_PC_BY_ALLOCATED      DT4DB2                14,096 MB

         DB_PC_BY_ALLOCATED      DT4DB3               141,912 MB

         DB_PC_BY_ALLOCATED      DT4DB4               154,608 MB

         DB_PC_BY_ALLOCATED      DT4DB5               426,958 MB

         DB_PC_BY_ALLOCATED      DT4DB6                 1,506 MB

         DB_PC_BY_ALLOCATED      _OTHER_DATABASE_      76.125 MB

 

CellCLI> list metriccurrent where name like '.*PC.*';

         DB_PC_BY_ALLOCATED      ASM                     0.000 MB

         DB_PC_BY_ALLOCATED      DT4ARIES                802,465 MB

         DB_PC_BY_ALLOCATED      DT4ETLSTG               14,114 MB

         DB_PC_BY_ALLOCATED      DT4MPI                  141,867 MB