Root Cause Analysis for z Systems Performance – Down the Rabbit Hole

By Morgan Oats

Finding the root cause of z Systems performance issues may often feel like falling down a dark and endless rabbit hole. There are many paths you can take, each leading to further possibilities, but clear indicators as to where you should really be heading to resolve the problem are typically lacking. Performance experts tend to rely on experience to judge where the problem most likely is, but this may not always be adequate, and in the case of disruptions, time is money.

Performance experts with years of experience are more likely able to resolve problems faster than newer members of the performance team. But with the performance and capacity skills gap the industry is experiencing, an approach is needed that doesn’t require decades of experience.

Rather than aimlessly meandering through mountains of static reports, charts, and alerts that do more to overwhelm our senses than assist in root cause analysis, performance experts need a better approach. An approach that not only shines a light down the rabbit hole, but tells us which path will lead us to our destination. Fortunately, IntelliMagic Vision can be your guide.

Continue reading

5 Reasons IBM z/OS Infrastructure Performance & Capacity Problems are Hard to Predict and Prevent

B._Phillips-web0By Brent Phillips

Solving z/OS infrastructure performance and capacity problems is difficult. Getting ahead of performance and capacity problems before they occur and preventing them is more difficult still. This is why it takes years, and decades even, for performance analysts and capacity planners to become experts.

And together, with the rapid retiring of the current experts, the difficulty in becoming an expert is why the performance and capacity discipline for the mainframe is experiencing a significant skills gap. It is simply too difficult and time consuming to understand what the data means for availability, let alone derive predictive intelligence about upcoming production problems within the complex IBM z Systems infrastructure.

The primary root causes of this performance and capacity management problem are:

Continue reading

Platform-Specific Views: Multi-Vendor SAN Infrastructure Part 2

By Brett Allison

Each distributed system platform has unique nuances. In Part 1 of this blog, I demonstrated how having a single view to manage your multi-vendor SAN infrastructure helped ensure performance and understand the overall health, performance and capacity. What is equally important to these common views is a solution that is capable of getting the detailed performance data capable of supporting vendor specific architectures.

New storage system platforms are popping up every year, and it’s impossible to stay ahead of all of them and provide the detailed, intelligent, performance views necessary to manage your SAN infrastructure and prevent incidents. However, IntelliMagic Vision supports a wide variety of SAN platforms for which we provide our end-to-end capabilities.

Continue reading

A Single View: Multi-Vendor SAN Infrastructure Part 1

By Brett Allison

One of the benefits of a SAN system is the fact that it is an open system. It’s always ready to communicate with other systems, and you can add storage and infrastructure from many different vendors as it suits your business and performance needs. However, just like a calculated job interview response, this strength can also be a weakness. Even if your distributed system can communicate with each other, it’s likely that your performance management solution is less “open” in this regard.

To properly manage the performance, connections, and capacity of your distributed system, you need something better than a bunch of vendor point solutions. You need to be able to manage your entire SAN infrastructure in a single view – otherwise the cost and hassle of having different performance solutions is not worth the benefits.

Continue reading

The Roots and Evolution of the RMF and SMF for Mainframe Performance Data (Part 2)

George DodsonBy George Dodson

This is part 2 of this blog. If you haven’t read the first section, you can read that here.

After being announced as a product in 1974, RMF was further expanded to provide more capabilities such as RMF Monitor 2 and RMF Monitor 3. These provided real time insight into the internal workings of z/OS to help understand and manage the performance of the z/OS infrastructure. The value of the RMF performance measurement data has been proven over the decades as it, or a compatible product from BMC named CMF, is used in every mainframe shop today. Many new record types have been added in recent years as the z/OS infrastructure capabilities continue to evolve.

A related product – Systems Management Facility or SMF – was originally created to provide resource usage information for chargeback purposes. SMF captured application usage statistics, but was not always able to capture the entire associated system overhead. Eventually, SMF and RMF were expanded to capture detailed statistics about all parts of the mainframe workloads and infrastructure operation, including details about third party vendor devices such as storage arrays. RMF and SMF now generate what is likely the most robust and detailed performance and configuration data of any commercial computing environment in the data center.

As the data sources to report on the performance of the workloads and the computer infrastructure grew, different performance tools were created to display and analyze the data. The information in the data was very complex and the total amount of data captured is overwhelming, creating challenges to identify performance problems. Typically, this requires analysts who have extensive insight into the specific infrastructure areas being analyzed, and an understanding of how they respond to different applications workloads. As applications have grown more complex, more real-time, with more platforms and components involved, the performance analysis task also has become more difficult.

Continue reading

The Roots and Evolution of the RMF and SMF for Mainframe Performance Data (Part 1)

George Dodson

By George Dodson

This blog originally appeared as an article in Enterprise Executive.

Computer professionals have been interested in determining how to make computer applications run faster and determine the causes of slow running applications for more than 50 years. In the early days, computer performance was in some ways easy because electronic components were soldered in place. To understand what was happening at any point in the circuitry, we simply attached a probe and examined the electronic wave information on an oscilloscope.

Eventually, we were able to measure activity at key points in the computer circuitry to determine things like CPU Utilization, Channel Utilization and Input/Output response times. However, this method still had many shortcomings. First, the number of probes was very small, usually less than 40. Secondly, this method gives no insight into operating system functions or application operations that may be causing tremendous overhead. And of course, when integrated circuits were developed, the probe points went away.

In 1966 I joined an IBM team that was focusing on a better way to conduct benchmarks in what was then named an IBM Systems Center. Customers considering computer upgrades would come to our data center to determine how their programs would operate on newly released hardware. But it was simply not possible to host every customer in this way.

Continue reading

RPO Replication for TS7700 Disaster Recovery

Merle SadlerBy Merle Sadler

This blog is on the topic of the impact of zero Recovery Point Objective (RPO) for Mainframe Virtual Tape Replication focusing on the IBM TS7700 replication capability.

Have you ever thought about how much money you will need to save for retirement? I was talking with my financial advisor the other day and decided that whatever you think you need you should double. You can plan on having social security but if social security fails then retirement plans start to look not so rosy.

budget tradoffs for RTO and RPO

The same thing applies to computer systems. Customers spend a lot of time and money on Disk replication, reducing both RPO and RTO. But what if an application corrupts the data or a virus is uploaded? Corrupted or infected data is replicated just as easily as good data. This lends to making offline backup copies of disk files which also need to be replicated.

Continue reading

6 Signs You Already Have a Skills Gap for z/OS Performance and Capacity Planning

B._Phillips-web0By Brent Phillips

The mainframe skills gap is a well-known issue, but most of the focus is on mainframe application development. A large z/OS mainframe organization may have thousands of application developers but only 20 or fewer performance & capacity planning staff. Even though fewer in number, these IT staff have an outsized impact on the organization.

The problem, however, is not just about recruiting new IT staff members to the team. The road to becoming a true z/OS performance and capacity (perf/cap) expert is far longer and more difficult than what is necessary for a programmer to learn to code in a mainframe programming language like COBOL. Consequently, it is not feasible to fill the performance and capacity planning gap with new recruits, and recruiting experienced staff from the short supply is difficult. Even teams that have all the headcount positions filled very often exhibit at least some of the signs that they are being negatively impacted by insufficient levels of expert staff.

A primary contributor to the problem is the antiquated way of understanding the RMF and SMF performance data that most sites still use. The way this data is processed and interpreted not only makes it difficult for new IT staff to learn the job, but it also makes the job for the existing experts more difficult and time consuming.

Here are six signs that indicate your z/OS performance and capacity team would benefit by modernizing analytics for your infrastructure performance and configuration data.

Continue reading

z/OS Performance Monitors – Why Real-Time is Too Late

By Morgan Oatsperformance monitor

Real-time z/OS performance monitors are often advertised as the top tier of performance management. Real-time monitoring means just that: system and storage administrators can view performance data and/or alerts indicating service disruptions continuously as they happen. In theory, this enables administrators to quickly fix the problem. For some companies, service disruptions may not be too serious if they are resolved quickly enough. Even though those disruptions could be costing them a lot more than they think, they believe a real-time monitor is the best they can do to meet their business needs.

For leading companies, optimal z/OS performance is essential for day-to-day operations: banks with billions of transactions per day, global retailers, especially on Black Friday or Cyber Monday, government agencies and insurance companies that need to support millions of customers at any given time, transportation companies with 24/7 online delivery tracking; the list goes on and on. For these organizations and many others, real-time performance information is in fact, too late. They need information that enables them to prevent disruptions – not simply tell them when something is already broken.

Continue reading

Dragging the Right Information Out of SMF/RMF/CMF for z/OS Disk Performance Analysis

Dave Heggen

By Dave HeggenDragging

Internal processing in IntelliMagic Vision is performed on a Sysplex Boundary. We want the SMF data from all LPARS in a Sysplex, and if multiple Sysplexes attach to the same hardware, then we want these Sysplexes together in the same interest group. By processing the data in this manner, an interest group will provide an accurate representation of the hardware’s perspective of activity and allow an evaluation of whether this activity is below, equal to, or above the hardware’s capability. It’s also true that the shorter the interval, the more accurate the data will be in showing peaks and lulls. The shortest interval you can define is 1 minute. This would typically be the average of 60 samples (1 cycle per second). It’s always a balancing act between the accuracy of the data and the size/cost of storing and processing the data.

Continue reading

Finding Hidden Time Bombs in Your VMware Connectivity

By Brett Allisontime bomb Brett

Do you have any VMware connectivity risks? Chances are you do. Unfortunately, there is no way to see them. That’s because seeing the real end-to-end risks from the VMware guest through the SAN fabric to the Storage LUN is a difficult thing to do in practice as it requires many relationships from a variety of sources.

A complete end to end picture requires:

  • VMware guests to the ESX Hosts
  • ESX hosts initiators to targets
  • ESX hosts and datastores, VM guests and datastores, and ESX datastores to LUNs.
  • Zone sets
  • Target ports to host adapters and LUNs and storage ports.

For seasoned SAN professionals, none of this information is very difficult to comprehend. The trick is tying it all together in a cohesive way so you can visualize these relationships and quickly identify any asymmetry.

Why is asymmetry important? Let’s look at an actual example:

Continue reading

No Budget for an ITOA Performance Management Solution

By Morgan Oats

no budget

Every department in every industry has the same problem: how can I stretch my budget to get the necessary work done, make my team more effective, reduce costs, and stay ahead of the curve? This is equally true for performance and capacity planning teams. In many cases, it’s difficult to get budget approval to purchase the right software solution to help accomplish these goals. Management wants to stay under budget while IT is concerned with getting a solution that solves their problems. When trying to get approval for the right solution, it’s important to be able to show how you will get a good return on investment.

Continue reading

Bridging the z/OS Mainframe Performance & Capacity Skills Gap

B._Phillips-web0By Brent Phillips

Many, if not most organizations that depend on mainframes are experiencing the effects of the mainframe skills gap, or shortage. This gap is a result of the largely baby-boomer workforce that is now retiring without a new generation of experts in place who have the same capabilities. At the same time, the scale, complexity, and change in the mainframe environment continues to accelerate. Performance and capacity teams are a mission-critical function, and this performance skills gap represents a great risk to ongoing operations. It demands both immediate attention and a new, more effective approach to bridging the gap.

Bridging the z/OS Mainframe Performance and Capacity Skills Gap

Continue reading

How Much Flash Do I Need Part 2: Proving the Configuration

By Jim Sedgwick

Before making a costly Flash purchase, it’s always a good idea to use some science to forecast if the new storage hardware configuration, and especially the costly Flash you purchase, is going to be able to handle your workload. Is your planned purchase performance capacity actually too much, so that you aren’t getting your money’s worth? Or, even worse, is your planned hardware purchase too little?

In Part 1 of this blog, we discovered that our customer just might be planning to purchase more Flash capacity than their unique workload requires. In part 2 we will demonstrate how we were able to use modeling techniques to further understand how the proposed new storage configuration will handle their current workload. We will also project how this workload will affect response times when the workload increases into the future, as workloads tend to do.

Continue reading

How Much Flash Do I Need? Part 1

By Jim Sedgwick

Flash, Flash, Flash. It seems that every storage manager has a new favorite question to ask about Flash storage. Do we need to move to Flash? How much of our workload can we move to Flash? Can we afford to move to Flash? Can we afford NOT to move to Flash?

Whether or not Flash is going to magically solve all our problems (it’s not), it’s here to stay. We know Flash has super-fast response times as well as other benefits, but for a little while yet, it’s still going to end up costing you more money. If you subscribe to the notion that it’s good to make sure you only purchase as much Flash as your unique workload needs, read on.

Continue reading

How to Measure the Impact of a Zero RPO Strategy

Merle SadlerBy Merle Sadler

Have you ever wondered about the impact of zero RPO on Mainframe Virtual Tape for business continuity or disaster recovery? This blog focuses on the impact of jobs using the Oracle/STK VSM Enhanced Synchronous Replication capability while delivering an RPO of 0.

A recovery point objective, or “RPO”, is defined by business continuity planning. It is the maximum targeted time period in which data might be lost from an IT service due to a major incident.

Zero RPO - Recovery Point Objective

Continue reading

The High Cost of “Unpredictable” IT Outages and Disruptions

By Curtis Ryan

High Costs of IT Outages

It is no secret that IT service outages and disruptions can cost companies anywhere from thousands up to millions of dollars per incident – plus significant damage to company reputation and customer satisfaction. In the most high profile cases, such as recent IT outages at Delta and Southwest Airlines, the costs can soar to over $150 million per incident (Delta Cancels 280 Flights Due to IT Outage). Quite suddenly, IT infrastructure performance can become a CEO level issue (Unions Want Southwest CEO Removed After IT Outage).

While those kinds of major incidents make the headlines, there are thousands of lesser known, but still just as disruptive to business, service level disruptions and outages happening daily in just about every sizeable enterprise.

The costs of these often daily occurring incidents, like an unexpected slowdown in response time of a key business application during prime shift, can have a significant cumulative financial impact that may not be readily visible in the company’s accounting system.

Continue reading

What’s Using Up All My Tapes? – Using Tape Management Catalog Data

BrettBy Dave Heggen

tape management catalog

Most of the data processed for IntelliMagic Vision for z/OS Tape is performance, event or activity driven, obtained from SMF and the Virtual Tape Hardware. Did you know that in addition to the SMF and TS7700 BVIR data, IntelliMagic Vision could also process information from a Tape Management Catalog (TMC)? Having this type of data available and processing it correctly is critical to answering the question “What’s using up all my tapes?”.

We’re all set up and distributed scratch lists. This is a necessary (and generally manual) part of maintaining a current tape library. It does require participation for compliance. Expiration Dates, Catalog and Cycle management also have their place to automate the expiration end of the tape volume cycle. This blog is intended to address issues that neither compliance nor automation address.

Continue reading

Clogged Device Drain? Use Your Data Snake!

Lee

By Lee LaFresePlunger

Have you ever run into high I/O response times that simply defy explanation? You can’t find anything wrong with your storage to explain why performance is degraded. It could be a classic “slow drain device” condition. Unfortunately, you can’t just call the data plumbers to clean it out! What is a storage handyman to do?

Continue reading

SRM: The “Next” As-a-Service

Brett

By Brett Allison

You may have seen this article published by Forbes, stating that Storage Resource Management (SRM) is the “Next as-a-Service.” The benefits cited include the simplicity and visibility provided by as-a-service dashboards and the increasing sophistication through predictive analytics.

IntelliMagic Vision is used as-a-Service for some of the world’s largest companies, and has been since 2013. Although we do much more than your standard SRM by embedding deep expert knowledge into our software, SRM, SPM, and ITOA all fall under our umbrella of capabilities. So, while we couldn’t agree more with the benefits of as-a-service offerings for SRM software, the word “Next” in the article seems less applicable. We might even say: “We’ve been doing that for years!”

IntelliMagic Software as a Service

IntelliMagic Software as a Service (or Cloud Delivery)

Continue reading

Noisy Neighbors: Finding Root Cause of Performance Issues in IBM SVC Environments

By Jim SedgwickNoisy Neighbors

At some point or another, we have probably all experienced noisy neighbors, either at home, at work, or at school. There are just some people who don’t seem to understand the negative effect their loudness has on everyone around them.

Our storage environments also have these “noisy neighbors” whose presence or actions disrupt the performance of the rest of the storage environment. In this case, we’re going to take a look at an SVC all flash storage pool called EP-FLASH_3. Just a few bad LUNs have a profound effect on the I/O experience of the entire IBM Spectrum Virtualize (SVC) environment.

Continue reading

Getting the Most out of zEDC Hardware Compression

Todd-Havekost

By Todd Havekost

One of the challenges our customers tell us they face with their existing SMF reporting is keeping up with emerging z/OS technologies. Whenever a new element is introduced in the z infrastructure, IBM adds raw instrumentation for it to SMF. This is of course very valuable, but the existing SMF reporting toolset, often a custom SAS-based program, subsequently needs to be enhanced to support these new SMF metrics in order to properly manage the new technology.

z Enterprise Data Compression (zEDC) is one of those emerging that is rapidly gaining traction with many of our customers, and for good reasons:

  • It is relatively straightforward and inexpensive to implement.
  • It can be leveraged by numerous widely used access methods and products.
  • It reduces disk storage requirements and I/O elapsed times by delivering good compression ratios.
  • The CPU cost is very minimal since almost all the processing is offloaded to the hardware.

Continue reading

Game Changer for z/OS Transaction Reporting

Todd-Havekost

By Todd Havekost

Periodically, a change comes to an industry that introduces a completely new and improved way to accomplish an existing task that had previously been difficult, if not daunting. Netflix transformed the home movie viewing industry by offering video streaming that was convenient, affordable, and technically feasible – a change so far-reaching that it ultimately led to the closing of thousands of Blockbuster stores. We feel that IBM recently introduced a similar “game changer” for transaction reporting for CICS, IMS and DB2.

Continue reading

How to Prevent an “Epic” EMR System Outage

By Curtis RyanElectronic Medical Records

Protecting the availability of your IT storage is vital for performance, but it can also be critical for life. No one knows this better than the infrastructure department of major healthcare providers. Application slowdowns or outages in Electronic Medical Record (EMR), Systems or Electronic Health Record (EHR) Systems – such as Epic, Meditech, or Cerner – can risk patient care, open hospitals up for lawsuits, and cost hundreds of thousands of dollars.

Nobody working in IT Storage in any industry wants to get a call about a Storage or SAN service outage, but even minor service disruptions can halt business operations until the root cause of the issue can be diagnosed and resolved. This kind of time cannot always be spared in the ‘life and death’ environment of the users of EMR systems in healthcare providers.

Continue reading

The Circle of (Storage) Life

Lee

Storage Life Cycle

By Lee LaFrese

Remember the Lion King? Simba starts off as a little cub, and his father, Mufasa, is king. Over time, Simba goes through a lot of growing pains but eventually matures to take over his father’s role despite the best efforts of his Uncle Scar to prevent it. This is the circle of life. It kind of reminds me of the storage life cycle only without the Elton John score!

Hardware Will Eventually Fail and Software Will Eventually Work

New storage technologies are quickly maturing and replacing legacy platforms. But will they be mature enough to meet your high availability, high performance IT infrastructure needs?

Continue reading

Compressing Wisely with IBM Spectrum Virtualize

Brett

By Brett Allison

 

Compressing Wisely - CompressionCompression of data in an IBM SVC Spectrum Virtualize environment may be a good way to gain back capacity, but there can be hidden performance problems if compressible workloads are not first identified. Visualizing these workloads is key to determining when and where to successfully use compression. In this blog, we help you with identifying the right workloads so that you can achieve capacity savings in your IBM Spectrum Virtualize environments without compromising performance.

Today, all vendors have compression capabilities built into their hardware. The advantage of compression is that you need less real capacity to service the needs of your users. Compression reduces your managed capacity, directly reducing your storage costs.

Continue reading

What Good is a zEDC Card?

BrettBy Dave Heggen

informatics inc: You Need Our Shrink!

The technologies involving compression have been looking for a home on z/OS for many years. There have been numerous implementations to perform compression, all with the desired goal of reducing the number of bits needed to store or transmit data. Hostbased implementations ultimately trade MIPS for MB. Outboard hardware implementations avoid this issue.

Examples of Compression Implementations

The first commercial product I remember was from Informatics, named Shrink, sold in the late 1970s and early 1980s. It used host cycles to perform compression, could generally get about a 2:1 reduction in file size and, in the case of the IMS product, worked through exits so programs didn’t require modification. Sharing data compressed in this manner required accessing the data with the same software that compressed the data to expand it.

Continue reading

How’s Your Flash Doing?

By Joe Hyde

Assessing Flash Effectiveness

How’s your Flash doing? Admittedly, this is a bit of a loaded question. It could come from your boss, a colleague or someone trying to sell you the next storage widget. Since most customers are letting the vendors’ proprietary storage management algorithms optimize their enterprise storage automatically you may not have had the time or tools to quantify how your Flash is performing.

The Back-end Activity

First, let’s use the percentage of back-end activity to Flash as the metric to answer this question. Digging a little deeper we can look at back-end response times for Flash and spinning disks (let’s call these HDD for Hard Disk Drives). I’ll also look at the amount of sequential activity over the day to help explain the back-end behavior.

Below is 5 weekdays worth of data from an IBM DS8870 installed at a Fortune 500 company. Although it’s possible to place data statically on Flash storage in the IBM DS8870, in this case, IBM’s Easy Tier is used for the automatic placement of data across Flash and HDD storage tiers. Let’s refer to this scheme generically as auto-tiering. For this IBM DS8870, Flash capacity was roughly 10% of the total storage capacity. Continue reading

Flash Performance in High-End Storage

cor-m

By Dr. Cor Meenderinck

This is a summary of the white paper with the same title which was the Winner of 2016 CMG imPACt conference Best Paper Award. It is a great example of the research that we do that leads to the expert knowledge we embed in our products.

Flash based storage is revolutionizing the storage world. Flash drives can sustain a very large number of operations and are extremely fast. It is for those reasons that manufacturers eagerly embraced this technology to be included in high-end storage systems. As the price per gigabyte of flash storage is rapidly decreasing, experts predict that flash will soon be the dominant medium in high-end storage.

But how well are they really performing inside your high-end storage systems? Do the actual performance metrics when deployed within a storage array live up to the advertised Flash latencies of around 0.1 milliseconds? Continue reading