Wikis > Common Tools for Automated Configuration & Change Management

Notice: The views and opinions expressed here are collective derived from the members of this ONUG working group and are not the express opinion of any individual or company. This group is the combination of Common Management Tools Across Network, Storage and Compute and the Change Management/Automated Initial Configuration use case working groups.

SPRING 2016

ONUG Use Case Working Groups Schedule: Common Tools for Automated Configuration & Change Management
Date Time
December 16, 2015 1:00pm-2:00pm ET Conference Call  
January 13, 2016 1:00pm-2:00pm ET Conference Call  
January 27, 2016 1:00pm-2:00pm ET Conference Call  
February 10, 2016 1:00pm-2:00pm ET Conference Call  
February 24, 2016 1:00pm-2:00pm ET Conference Call  
March 9, 2016 1:00pm-2:00pm ET Conference Call  
March 23, 2016 1:00pm-2:00pm ET Conference Call  
April 6, 2016 1:00pm-2:00pm ET Conference Call  
April 20, 2016 1:00pm-2:00pm ET Conference Call  
May 5, 2016 1:00pm-2:00pm ET Conference Call  
May 9-11, 2016 TBD ONUG Spring 2016 – Face-to-Face
Working Group Chairmen
Carlos Matos – Fidelity Investments
Ted Turner – Intuit
Carlos Matos – Fidelity Investments
Members
Sanjai Narain Applied Communication Sciences
Mengzhu Peng Bank of America
Ravi Malhotra BNY Mellon
Brian Hedstrom DataVision
Maxime Bugat Deloitte
David Stern DISA
Amiri Gonzalez Express Scripts
Kent Brown FedEx
Kevin Irwin Fidelity Investments
James Noble Gaikai
Thom Schoeffling General Dynamics IT
Piyush Gupta JPMorgan Chase
Bob Natale MITRE
Hailu Meng Monsanto
Gideon Tam Memorial Sloan Kettering Cancer Center
Brian Ong Navy Federal Credit Union
Shradha Sharma PwC Advisory Services
Azharul Mannan Standard Chartered Bank
Scott Kline State Farm Insurance
Scott Williams State Farm Insurance
Mahesh Desai Tesla Motors
Sumanth Bopanna Thomson Reuters
Chris Parchinski Unilever
Sean Wang University of British Columbia
Matthew Caesar University of Illinois
Rahul Godha University of Wisconsin-Madison
Phil Hattwick Wellington Management
Zahid Kalwar XL Global Services
Mansour Karam Apstra
Kurt Bales Brocade
Hadi Nahari Brocade
Aziz Abdul Cisco
Ravi Rajamani Cisco
Jeff Gray Glue Networks
Stefan Dietrich Glue Networks
Mike Haugh Glue Networks

Meeting Notes

December 16, 2015

January 13, 2016


FALL 2015

ONUG Use Case Working Groups Schedule: Common Tools for Automated Configuration & Change Management
Date Time
October 1, 2015 2:00pm-3:00pm ET Conference Call  
October 15, 2015 2:00pm-3:00pm ET Conference Call  
November 3-5, 2015 TBD ONUG Fall 2015 – Face-to-Face
Working Group Chairmen
Carlos Matos – Fidelity Investments
Ted Turner – Intuit
Conference Bridge # ONUG Webex
Members

Daldy Rustichel Youbou Biagha

Association Congolaise pour le Developpement Agricole

Mengzhu Peng Bank of America
Brian Hedstrom DataVision
Maxime Bugat Deloitte
Kevin Irwin Fidelity Investments
Thom Schoeffling General Dynamics IT
David White Institute of Technology Tallaght Dublin
Piyush Gupta JPMorgan Chase
Bob Natale MITRE
Gideon Tam Memorial Sloan Kettering Cancer Center
Shradha Sharma PwC Advisory Services
Sumanth Bopanna Thomson Reuters
Sean Wang University of British Columbia
Matthew Caesar University of Illinois
Phil Hattwick Wellington Management
Jeff Gray Glue Networks
Stefan Dietrich Glue Networks
Michael Githens Ixia
Pierre Lynch Ixia

Minutes from the Change Management/Automated Initial Configuration Working Group (Fall 2015)

August 13, 2015

September 10, 2015

October 15, 2015 Use Cases (link)

October 1, 2015

Notes from last meeting

  • Previous work identifying metrics and telemetry for all 3 domains (network / Storage / Compute) overlaps with the network state group so we will drop the idea
  • Possible integration with controller federation framework
  • Open Network Discovery standard (Stephan)
  • Based on comments from Brian regarding the differences of the two groups in terms of timing and phases, a possible alternative is to identify  gaps that on the entire management process to define possible use cases and for the November time frame we present recommendation for open systems or solutions for these different gap areas

Some other use cases or ideas (Please bring as many as you can think and we can vote)

  • Open IPAM and IP Addressing broker solution
  • Open Asset Management /  CMDB and asset configuration Integration
  • Open switch Configuration and Open switch API standard
  • Shared Policy management configuration
  • Open Event management

August 13, 2015

September 9, 2015

Minutes of the Common Management Tools across Network, Storage, and Compute Working Group (Last Updated Spring 2015)

Carlos Matos – Chairman

This kick-off meeting of the Common Management Tools across Network, Storage, and Compute Working Group clarified its source: Issues that arose from fireside discussions and informal gatherings at the fall 2014 ONUG conference.

In particular, the initial catalyst was the lack of statistics and metrics across different networking practices, and the inability to consolidate information from the different silos.

The ONUG user- and vendor-members wanted to take it further, and find a formal way to present and share their views about common management tools that ran across a gamut of networking dimensions. That was the genesis of this group.

Carlos Matos, the chair of the group, encouraged the members of the group to write use cases, which could find dependencies with use cases generated by other working groups.

The following topics were discussed:

–        Defining the specific tools that need to be considered;

–        Agreeing on the scope and goal of the working group; and,

–        Building the use cases

The group’s first order of business was defining the scope of the Working Group. This brought about an energetic discussion amongst the members.

Some of the questions asked were:

1)     Is the scope of the group limited to monitoring issues or does it encompass orchestration and management as well?

2)     What role would orchestration play in the scope?

3)     Does it include monitoring of private and private clouds?

4)     Does it include statistics for problem management, quality management, usage for billing and/ traffic capacity planning?

The group decided to look initially at metrics and monitoring, and take on orchestration and management issues later if it seemed feasible.

Specifically, the first iteration of the use case could get into the next generation of collection of metrics, information and telemetry of all the devices – compute, storage, server farms, and networking equipment – in a given network in different areas.  It also involves avoiding current middleware, and getting into a big data solution.

Carlos  said he would have a well-defined scope following discussions with other chairs and board members, and post it on the ONUG wiki page for comment from other members.

The group’s next meeting is on Feb. 26th.

However, Carlos said he could call for a meeting before that in mid-February for feedback to create a rough draft of the use case and related content.

Additional feedback from vendor-members could result in a final version of the white paper by end of March.

The members can access the wiki page after registering at the ONUG site, and provide feedback.

The plan is to present the white paper at the upcoming Spring ONUG conference at Columbia University in New York City on May 13-14, 2015.

————————————————————————————————————————————————————————-

General Overview

The following document has been created to identify the ONUG use case for Common Management / Monitoring Tools across Network, Compute, and Storage, defining requirements, guidelines and overall framework, that could help the user community to obtain the basi123        we13q4ws for requesting Open interoperable solutions, and information from vendors, and for the vendor, and open source communities to understand the requirements to provide answers.

ONUG’s goal is to facilitate the exchange between communities identifying opportunities for creation of new open environments of interoperability and reusability of solutions. As for other ONUG use cases, in this particular one the bar is set high as we do not only look at one specific vertical, but multiple by trying to understand requirements across multiple solution elements from the monitoring and managements perspective

Methodology, Scope, and Reach

As a level setting we feel the need to clarify the intent and scope of this effort. As the name of the use case suggests, the opportunities and reach could be very broad, and with the hope of not becoming too generic or overlapping on other important initiatives, we wanted to establish some basis for the scope of this effort

  1. We could have focused the scope on a specific OS or a      hypervisor or embedded systems used, but as we see multiple, we want to      remain agnostic of the OS and concentrate on tools that can openly be used      for all of them. In addition as ONUG’s name may suggest a Network focus, we      are also including Storage, and Compute as our scope will be across all of      these elements equally
  2. The goal is the use technologies and solutions based on open      development concepts / licenses to identify tools to monitor, manage, collect      metrics, analytics, forensics, and other insightful information from physical      or virtual devices to help perform event management, troubleshooting and      the new hope to create self-remediation across network, Compute and      Storage keeping the key attributes of Openness and transparency as      the vehicle to realize interoperability.
  3. As for the Management part of the use case title, we want to Leverage      as much as possible and not to get confused with other Open Orchestration      efforts like OpenStack, Openflow, OpenDaylight, trying not to overlap on      the orchestration / Management front.
  4. The Concept applies not only to physical bare metal, but      virtual and not only to switches or routers, but to all kind of P+V (Physical      + Virtual) appliances, servers, and devices, which have many elements in      common although the challenge may be the commonalities on the software      side.
  5. Require the use of Open Standards and Open Tools including      the possible adoption and standardization of North Bound APIs
  6. Evaluate as part of the management capabilities,      configuration management that are common for network, compute, and      storage.
  7. Require Open interoperability and programmatic standards for      both the control plane, and the data plane across solutions

We will list and gather as many requirements as possible from our ONUG community with the idea that the same community votes for the ones that have the highest priority / importance to them

This paper’s goal is to address the function or capability needed from monitoring and management tools, and remains agnostic on the product, brand or make that satisfy those needs.

Executive Summary

The adoption of open, generic,  and even open source solutions has become a general practice. With the explosion of Cloud solutions, compute, virtualization and orchestration, we have seen the adoption of the virtualization model not only on compute environments, but also extending the concept to storage and network functions, where multiple tenants co-exist requiring common services that could be either centralized or distributed, and are not only consumed by tenants, but made available to perform a number of actions including Auto improve the process to capture and process data and the way we analyze it. The

As more and more general purpose Open Operating Syst environments have been used as a way to replace the custom made solutions from the past. By the same token, certain areas that were really closed and proprietor environments for example the Network switching and Routing segments, have been experimenting a drastic change with the adoption of commodity hardware chipsets or ASICs, which have allowed a model where multiple hardware vendors can be used interchangeably and that now the Open OS and Hypervisors are also being ported into these platforms bringing similarities and commonalities with storage and compute models.

Regarding these common services, they can be  a number of different tools and other peripherals necessary for this distributed or local functions that are used for management, monitoring, security, event management, orchestration, patching, analytics, metrics / Estate collection, passive and active health-checking, etc, which run as modular applications(we define each module and function on this link) on these platforms and can be commonly used by Compute, network and Storage.

Use Cases

1-         Generally available, transparent, and Simplified Monitoring of entire infrastructure

  • Collection of metrics and Time series data (querying, pooling, trapping) across any type of assets on the infrastructure whether they are compute, Storage or Network systems
  • Use of standard OS Metrics rather than case by case isolated MIBS
  • Standardize and simplified correlation engines using similar signatures
  • Use of specialized multiplatform telemetry agents
  • Elimination of middleware monitoring agents by having the possibility to report to distributed bigdata footprints directly

2-         Unified Management and orchestration of complete infrastructure

  • Common orchestration and automation for entire Infrastructure management and not only the Private Cloud
  • Standard Service delivery process and improvement for service and catalog management
  • Ubiquitous tools for management, automation and control of the entire infrastructure
  • Control plane management and programmatic capability to customize the data plane of every device
  • Capability to synchronize systems and applications state among  multiple regions Standard and unified security enforcement Monitoring throughout entire infrastructure
  • Capability to segment and Isolate infrastructure and controlling it on every asset
  • Unified and standard AAA across all platforms
  • Role based access control across the entire infrastructure
  • Standard auditing capabilities
  • Authorization as a Service

3-         Ubiquitous Support and troubleshooting for all environments

  • Common operational tools for the new Infrastructure administrator role rather than the Network/Compute/Storage administrators
  • Elimination of silos between tools used for Compute/Storage/Network
  • Capability to run forensics and self remediation
  • Simplification of reporting and capture data by using simplified metadata of any flow or metric within any system

Conventions

The following conventions are used throughout this document. The requirements that apply to the functionality of this document are specified using the following convention. Items that are REQUIRED (contain the words MUST or MUST NOT) will be labeled as [Rx]. Items that are RECOMMENDED (contain the words SHOULD or SHOULD NOT) will be labeled as [Dy]. Items that are OPTIONAL (contain the words MAY or OPTIONAL) will be labeled as [Oz].  In addition, a priority value of High (H), Medium (M) or Low (L) may be assigned to each item.

The priority will be labeled as [RHx], [DHy] or [OHz] for High priority, [RMx], [DMy] or [OMz] for Medium priority or [RLx], [DLy] or [OLz] for Low priority.  The integer values {x, y, z} shall be unique across the document but are not required to be unique across the 3-tuple set {x, y, z}.  For example, RM10 and DM10 is allowed whereas RM10 and RL10 is prohibited. Requirements in this document are numbered using increments of 10.

Where needed, related sub-requirements are numbered using increments of 1. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.  All key words use upper case, bold text to distinguish them from other uses of the words. Any use of these key words (e.g., may and optional) without [Rx], [Dy] or [Oz] is not normative. The priority assignments are defined as follows.

  • High (H):  Functionality that must be supported at day one and is critical for baseline deployment.
  • Medium (M):   Functionality that must be supported, but is not mandatory for initial baseline deployment.
  • Low (L): Desired functionality, which should be supported, but can be phased in as part of longer-term solution evolution.

Problem Statement

The initial issue between these different verticals (Storage, Compute, Network) has been in most  cases that each discipline is seen as a  completely different practice, and even between professionals that work on each area there is little collaboration or communication among them with a tendency to create silos. The new environments and challenges derived from the “Everything as a Service” models, require an unprecedented level of integration, which let us to predict the future integration of the Engineering roles working all 3 disciplines (Storage /Compute /Network) Into the integrated role of the Infrastructure Engineer or administrator, which would also need to heavily  integrate  a more intimate knowledge of a  4th discipline, “Automation and Orchestration”.  Therefore Common tools and unified solutions could be the glue that will help the new Infrastructure Engineer or Operations personnel to bridge the gap between them.

Another challenge is to define a common reference architecture for all 3 of environments and practices in order to recommend or produce common utilities, services, and modules that can be used interchangeably and that would offer the depth, and reach that each discipline requires. Based on this, we have taken the initiative to define a reference architecture that can be extensible enough to define all of the interfaces and interrelations between systems and how the common tools and services can be adapted for all of them

Other issues that offering challenges and would also benefit from a common strategy for management and monitoring are:

  • In Spite of all the new advances in technology over the years, we still struggle to find the right mix of monitoring for devices, and servers. Whether polling or trapping or collecting the data, correlating,  reporting and performing event management, there are many gaps on how we monitor, what we monitor and what we really detect
  • Efficiency and effectiveness of monitoring tools offers many challenges. There are so many different monitoring products and technologies being used in an overlapping fashion throughout enterprises, and networks in general, but still there are elements not covered. The amount of traffic generated by monitoring solutions may be in some cases 30% or more of the total amount of network traffic and in many cases there are duplicates or triplicates of the same monitoring, either because we resend it to multiple recipient or because multiple solutions monitor and report on the same devices.
  • Legacy devices or technologies would not offer the same disaggregated or open model as in Figure1  so they will need to be dealt on a separate way or may not be integrated
  • In Some Cases Embedded OS platforms even when using Open OS, may not allow the installation or addition of more applications or complimentary tools, which could limit the usefulness of the model. As more powerful processors make their way into bare metal compute, network switches and other devices, we see vendors offering full distributions of the OS being ported given the capability to ass a large array of tools and functionality
  • Even when we could have the Modules and capability to add tools or enable available apps, 3rd party tools or complimentary tools that have not completely being certified or validated by vendors to be used on their OS or Devices, may bring issues so there could be a need for  a certification program requiring more overhead to the process.
  • Lack of standardized Northbound API solutions makes difficult the adoption of specific tools.
  • More and more we see the benefits from automation, and orchestration, however we still see little Self-Remediation or self-healing capabilities
  • A number of complimentary or 3rd party tools that could be implemented are Open Source, and as in the past several disciplines considered Open Source to be not reliable. On the contrary, open source has demonstrated not only that it can be a very fast way to develop solutions, and  solve very complex problems, but also to produce rock solid products,  but it may require that in some areas people change their precepts or preconceived opinions.

General Reference Architecture for Storage, Compute, Network Monitoring and Management tools

Looking to be as extensive and open as possible, we realized that in order to define solutions across multiple types of systems; we needed to define a reference architecture that could be used generically for all types of systems, and extensible enough to represent the modules and connectivity points for subsystems and general Interfaces. The figure below depicts an alternative view, or model, of modern IT technology. This model is comprised of three aggregated layers that generically represent:

  • Hardware in the form of a physical or virtual chassis;
  • An operating environment that can represent traditional operating systems, virtual machine monitors (hypervisors), or embedded system software environments, along with their requisite system services, runtime libraries, system utilities and related software interface (API) definitions;
  • Application environment that describes software systems layered on the operating environment that define the system’s functional purpose or nature (server, workstation, network switch/router, storage appliance); as facilitated through multiple levels of APIs.

These aggregated layers and their respective constituent layers are intended to be both iterative and recursive; that is, elements of the model may be replicated horizontally, or multiply instantiated hierarchically to construct more complex system-of-systems, or decomposed hierarchically into constituent system elements. In all cases, this model does not rely on specific examples of commercial or open systems solution architectures, designs or implementations to describe the features, capabilities or systemic qualities of the system of interest.

 

The assertion being made here is that all modern IT systems providing compute, network and storage services can be described using the model elements depicted in the figure above and described below. In doing so, the relevant features, capabilities, interfaces and systemic qualities of the desired system will emerge at a sufficient level of detail so as to develop properly formed requirements definitions (statements of need) that are both measurable and testable.

Physical/Virtual Chassis

The Physical/Virtual Chassis aggregation layer models either a physical hardware system or a virtual system that mimics a physical hardware system. The model is intended to describe the constituent elements of system classes comprised of processing systems, storage systems, or networking systems; that is, servers and workstations, network switches and routers, storage arrays and appliances, or the aggregation of these elements into special-purpose devices or the systems-of-systems that define a modern IT environment.

 

Hardware Represents the physical or virtual hardware expressed in the form of a “chassis”. All modern IT systems share some or all of the constituent system elements shown in the figure to some varying degree. Every IT system has some form of processor; be it a traditional microprocessor, a scalable multi-core embedded processor, a graphical processing unit (GPU) or a specialized field-programmable gate array (FPGA), or application-specific integrated circuit (ASIC), or other processing architecture. Every IT system has some form of memory or temporary/volatile storage (although this is now converging with non-volatile storage); currently implemented using semi-conductor technologies. Every IT system has some form of non-volatile storage (data that persists across system invocations); be it magnetic, flash, or other semi-conductor based technology.

In modern systems, every device has some form of network connectivity, or an interface that facilitates communications with external systems. Most modern systems have some form of low-level interface that enables direct communication with the underlying hardware. And many modern systems have some form of hardware security implant (embedded subsystem) designed to increase the work-factor or level-of-difficulty on the part of an attacker attempting to compromise the system. All of these elements exist to some degree in physical systems and are emulated, exposed or otherwise abstracted to some degree in virtualized systems. As such, this model can be used to represent various classes of servers, storage and network systems, in either the physical or virtual world, by varying the number and type of processors, memory, storage, network, peripherals and/or security devices that characterize each discrete system, or expressing those objects as parameterized software abstractions.

 

Processor – Represents all forms of single-instruction, multiple data (SIMD) and multiple-instruction, multiple-data (MIMD) processing architectures and derived variants. This includes conventional workload CPUs (e.g., x86, SPARC, ARM), as well as highly integrated system-on-a-chip (SoC), GPUs, GPGPUs and special-purpose processing devices (FPGAs, DSPs, custom ASICs). This also includes multi-processor systems (SMP, ccNUMA, NUMA). In the physical world, characteristics of interest include specifics about processor capabilities, processing units (cores), cache levels and sizes, clock speeds, workload efficiency, memory management, integrated subsystems (MMU, PCI, USB, network), hardware instruction sets (x86_64, SPARCv9), instruction set extensions for important operations such as virtualization, message switching/routing, encryption and floating-point compute (AES, VT-x, VT-d, AVX2, AMD-V), power and thermals. In the virtual realm, important characteristics include virtual machine instruction sets (byte-code), processor and socket affinity, peripheral pass-thru (GPU, PCI, USB, storage) and execution caps.

 

Memory Represents the system’s primary runtime storage subsystem holding both data and instructions. Memory is closely coupled to processor subsystems and includes data storage technologies/devices that may or may not persistent data across system invocations. In the physical world, characteristics such as addressable space, density, access latency, throughput, data/address bus width, error correction and power consumption are of interest. In the virtual domain, addressable space, latency, isolation and oversubscription are of interest.

 

Storage Represents the system’s secondary storage subsystems, typically holding system and application software and data in a persistent state that is maintained across system invocations. In physical world storage density, throughput, IOPS, interfaces, protocols, encryption, dedup, compression are key operational parameters along with systemic qualities related to redundancy, resiliency, recoverability, security, and manageability, and form-factor and power/thermal characteristics. In the virtual scope, disk image formats (VMDK, VDI, HDD, VHD, QED, QCOW), image write modes (dynamic allocation, differential, immutable, write-thru, sharable, multi-attach) and interface emulation are the focus.

 

Network Represents all forms of system-to-system communication, including classical networks stacks and topologies (OSI model) and fabrics that implement communication protocols. In the physical domain, this represents infrastructure devices (switches, routers, fabric concentrators), telecom service provider demarcs, converged and non-converged device interfaces, wired transmission media, wireless transceivers and their management functions. In the virtual realm, this represents software defined network (SDN) elements that facilitate connectivity between objects through network function virtualization (NFV), their protocols and related management functions. With the exception of L1 and L2 protocols, almost all higher layer protocols are implemented in software (one exception being encryption); thereby enabling NFV that facilitates SDN in virtualized systems.

 

Peripheral Interfaces – Represents all input/output device interfaces to the system, including human interfaces devices (keyboards, pointing devices, displays, haptic devices, PCI, USB, SAS/SATA, SCSI, HDMI, DP).

 

Security Modules Represents purpose-built embedded devices (implants) integrated within the system for ensuring the runtime confidentiality, integrity, and availability of the system with respect to identification, authentication, and authorization.

Firmware – Represents low-level software subsystems specifically designed to manage embedded hardware functions; often used to initiate runtime of higher order software systems.

 

Low-Level Interfaces Represents minimal communication methods between a system or device and an external system or human; typically using a serial protocol (serial console, JTAG).

P/V Chassis Characteristics – Represents the characteristic of a physical or virtual chassis. In the case of a physical chassis, this would include characteristics or specifications related to size, mass, construction materials, cooling fans, power supplies and other supporting physical attributes of the chassis. In the case of virtual chassis this may include characteristics related how a hypervisor or other virtualization machine monitor emulates devices, arbitrates access, and allocates the physical resources of the virtualized host on behalf of the guest operating environment, such as device emulation, processing execution caps, memory oversubscription, peripheral passthru, and the like.

 

Operating Environment

The Operating Environment aggregation layer models the system software (at various layers) that facilitates management and use of the underlying physical or virtual hardware. This layer represents the software system elements that enable the physical or virtual hardware platform to fulfill its function. The degree of simplicity or complexity within each software element described in this layer is boundless. In all cases, the primary function of the operating environment is to manage and arbitrate access to physical resources, or the virtualized abstractions of those resources.

 

Kernel At the heart of the operating environment is some form of kernel; a low-level software system with exclusive access to the underlying platform that manages and arbitrates access to physical or virtual hardware resources (processor(s), memory and devices). The kernel typically implements some form of a hardware abstraction layer to facilitate portability across different device architectures. Kernels schedule work to be done with platform resources through process management (multi-tasking, multi-processing, inter-process communications, device reservations) where access to resources is arbitrated on behalf of higher-level system services and applications. Kernels are designed and implemented using various architectures (monolithic, microkernels, modular kernels, others).

 

Kernel Modules Represents loadable/pluggable software components used to extend kernel functionality to support some specific feature, capability or device not part of the baseline or core kernel. Not every operating environment employs kernel modules. For classical operating systems, kernel modules are developed to support features and functionality (file systems, encryption, communications) requiring privileged access to resources (processor, memory, peripherals). Hypervisors employ kernel modules to perform memory address allocations, processor context switches or implement advanced peripheral (device emulation) features on behalf of a guest system.

 

Drivers Represents a special class of kernel modules that may or may not be loadable, but can be built (compiled in) with the core kernel software elements. Drivers typically implement interfaces for new hardware or software systems or features (cut-through switching) developed after the baseline kernel.

 

Kernel APIs Represents software interfaces that allow system software, utilities or other higher-level systems to programmatically access kernel managed system resources and services in a controlled and coordinated manner (system calls). These include in-kernel APIs used to manage interaction between kernel subsystems.

 

System Services Represents system-level run-time processes integrated within the operating environment that performs higher level functions than those provided by the kernel. These services often run in a privileged mode and are typically responsible for the run-time configuration and management of the system (init systems, monitors, load balancing).

 

System APIs Represents programmatic software interfaces that facilitate communications between system services and other system services, as well as applications needing run-time support from the system.

 

Runtime Libraries Represent loadable (memory mapped) object code that provides some generic functionality to system or application processes executable during execution as a callable library routine facilitated through dynamic linking of code segments.

System Utilities Represents software facilities integrated within an operating environment that perform operational control and management functions to configure and/or maintain the environment according to a defined strategy or policy for use by application processes.

Application Environment

The Application Environment aggregation layer models the software elements that are deployed on a system to support end-users. Typically, this aggregation layer is closely associated with traditional processing, end-user oriented systems (desktops, servers, mobile devices), but can be extended to describe the adaptation of network- and storage-centric systems that permit end-user customization (appliances, cloud services).

 

Applications Represent a class of software processes that perform specific functions or tasks on behalf of an application consumer (end-user). Applications may include software processes that run in unprivileged contexts, or those that integrate application functions with system functions through application APIs and system APIs (Type-II hypervisors).

 

Application APIs – Represents programmatic interfaces to a software application process

Definition of Common Monitoring and management Services

The use of closed proprietor system solutions, and closed control planes in the past, made the integration or combination of tools and utilities under different types of systems very difficult. As the use of common Operating systems, open solutions, architectures, and interoperability between platforms have being made available across Compute / Storage and Network equipment, that tendency has started to reverse itself.

With the possibility to include standard Core utilities, automation / Orchestration, Operationally rich system’s services, and even the possibility to add complimentary or third party tools the new platforms can offer a truly open environments that could be characterized as disaggregated services, where the power comes not only from the functionality of the main platform’s purpose but from the sum of the parts.

The picture below shows in general terms the adaptation of multiple different groups of services to the system structure

 

As seen in the reference above, there are a number of modular entities in which specific services can be provided or offered to achieve those functions. The vision is not limiting, but inclusive as we provide ideas and examples of what the tools could be or how they can integrate on the different modules of the solution, many more ideas would come as how to adapt and integrate all of the functions and features. In general terms the modules identified above, can be categorized in the following groups

A)     Management and Orchestration Services, to offer integration with both the more popular orchestration systems, but also to create flexible references that can be used to adapt to multiple automation options and integration with well-known devops trends.

  1. System Utilities (native or original functionality defined by the solution creator showing general software utilities integrated within the operating environment that perform main operational control and management functions)
  2. Complementary / Collateral  application Services (a.k.a 3rd party apps or additional applications) that can be integrated or added into the solution as a way to adapt existing tools that can help improve functionality, adapt to existing requirements or even a way to reuse / integrate tools and products that have demonstrated their usefulness or compliance to standards and open solutions
  3. System Services (system-level run-time processes integrated within the operating environment that often running in a privileged mode). On this case the big difference is that specific functions, products or services can truly be ported with higher level of control on how the resources of the system could be shared or be given access to, making it possible to create services Virtualization and a great platform for a combined infrastructure as a service Integration. As an example you could envision for example a Network TOR Switch that could also offer firewall and load balancing capabilities among others, and many other examples applicable to storage and Compute as well.
  4. The Automated Provisioning Service, a.k.a (ZTP)  Zero touch provisioning for Compute, Storage or Network that allow for systems to automatically boot, get an address, install the right OS needed, and obtain via automation right configuration with minimal intervention

In the following sections we will go in detail into every one of these sections showing not only the potential and use cases, but also the requirements that have been defined for each of them

General Model Requirements (See Appendix for definitions and disambiguation)

R-xx: Each system element within the system-of-interest shall be capable of reporting its current state to an external subject, within the context and control of an integrated or cooperating access control system implementing a defined access control policy.

Priority: High

 

R-xx: Each system element within the system-of-interest shall implement an interface and protocol capable of receiving subject requests (queries) for object state data and reporting such data to a subject authorized to receive it.  A subject shall be able to request multiple object state elements within a single request.  The reporting of state data may be managed and delivered through an intermediary system operating under the same access control policy as the system element originating the state data.

Priority: High

 

R-xx: The interface and protocol shall include a method to report an unsuccessful attempt to request object state that provides human interpretable diagnostic messages describing the relevant reasons or causes leading to an unsuccessful attempt, at a sufficient level of detail so as to be actionable by a human subject.

Priority: High

 

R-xx: Data for each object state element reported by a system element within the system-of-interest shall consist of at least the following data items:

unique object identifier – a token or other identifier that uniquely identifies the object within the scope of the system-of-interest;

object state element label – an unambiguous label, or tag, identifying the object state element being reported;

object state element value – the value assigned to the object state element being reported, at the time it is requested;

object state element datatype – the data type of the object state element value being reported;

object state element units – the value units of the object state element being reported, or an indication of a value that is unit-less;

object state element mutability – an indication as to whether or not the object state element value is mutable or immutable by a subject;

object state element permissions – an indication of the actions the requesting subject is permitted to perform on an object state element that is mutable;

object state element timestamp – a date/time stamp as described by ISO 8601/RFC 3339, derived from the system’s internal clock, with resolution of at least fractions of a seconds, indicating when the object state data was retrieved by the object.

Priority: High

 

R-xx: By default, data for applicable state elements reported from an object within the system-of-interest shall be reported using International System of Units (SI Units).  If the object or parent system provides a conversion method to other measurement reference systems, then the object or parent system shall provide an interface for subjects to set or select the desired measurement reference system.

Priority: High

 

R-xx: Each system element within the system-of-interest shall be capable of receiving requests from external subjects to alter its current state, within the context and control of an integrated or cooperating access control system implementing a defined access control policy.

Priority: High

 

R-xx: Each system element within the system-of-interest shall implement an interface and protocol capable of facilitating subject requests to alter an object’s state by means of a method provided by the object to the subject through the interface.  The subject shall specify through the object method what object state element is to be altered by providing the subject’s desired transition values for the object state element.

Priority: High

 

R-xx: Each system element, upon receiving a request from a subject to alter its state shall validate the syntax and semantics of the request.  If the alter request is deemed valid, then the object will attempt to satisfy or execute the request.  If the alter request is deemed invalid, then the request is rejected and the subject is notified of the rejection in a manner that facilitates an actionable response to the rejected request.

Priority: High

 

R-xx: Each system element, upon receiving a validated request from a subject to alter its state shall attempt to satisfy or execute the request on behalf of the subject.  If the alter request is successful, then the object shall notify the subject of successful state change and report (as described in R-?? above) to the subject the new current values for each object state element that has been altered.  If the alter request is unsuccessful (fails), then the subject is notified of the unsuccessful attempt in a manner that facilitates an actionable response to the unsuccessful request.

Priority: High

 

R-xx: Each system or system element (hardware or software) within the system-of-interest capable of reporting state shall be able to report the following immutable object state elements to an authorized subject:

uid – an identifier for the system or system element that is unique within the system-of-interest;

name – a human interpretable symbol or moniker unique to the system or system element;

short description – a human readable descriptive summary of the system or system element not to exceed one sentence;

long description – a human readable descriptive statement of the system or system element not to exceed one paragraph;

supplier – the source or originator of the system or system element (manufacturer, vendor, software publisher, open-source community);

creation date – the supplier provided date/time when the system or system element was manufactured or otherwise created;

install date – the date/time when the system or system element was introduced into the system-of-interest;

Priority: High

 

R-xx: For hardware systems and system elements capable of reporting state shall report the following additional immutable object state elements to an authorized subject:

model – the supplier designation for the system or system element that identifies it as a particular product or product group;

revision – the supplier designated engineering change version for the system or system element that uniquely identifies the model variant as provided;

serial number – the supplier designated unique identifier for the system;

parent – the system for which this system is a system element;

power-on hours – the accumulated number of hours since installation that the system has been consuming power under nominal operation;

Priority: High

 

R-xx: For software systems and system elements capable of reporting state shall report the following additional immutable object state elements to an authorized subject:

version – the supplier designated software revision for the system or system element that uniquely identifies the software package;

release – the supplier designated software version variant provided to a consumer;

install location – the base install path where the software system element is installed;

software group – the family of software systems that this software belongs;

software license – the license under which the software is released;

package filename – the container file holding all of the software elements comprising this software system;

package size – the uncompressed size in bytes of all software elements in the package;

package signature – the cryptographic hash/key, signing date, and hash algorithm used to verify integrity of the software package;

package manifest – the contents of the software package;

Priority: High

 

Reference Modules, Services definition and Requirements

The idea of disaggregated Services, not only is applicable to all system services, but as can be seen in the picture, there are many elements that could be commonly used among them.

  • As a general note, the use of an OS (Linux, Windows, Solaris, Mac OS X, AIX, FreeBSD, NetBSD, and OpenBSD or common hypervisors ESXi, XEN, KVM, etc) For Storage, Network devices or Compute brings the capability from each case that all of the same tools and applications that can run on those Common OS can be used by all 3 disciplines commonly. The concept is adaptable to both physical or Virtual devices/Appliances/Servers

 

Orchestration, and Automation Access Points:

As we have keep a vendor and product agnostic focus, we distinguished the level of integration, interrelations and interactions that Automation and Orchestration have with several open source projects, which we have tried to leverage instead of trying to reinvent the wheel for orchestration and management as can be seen on the image for some of the most important Open system projects.  On the other hand, the importance to recognize as part of our reference that these orchestration solutions whether outsourced or built internally could have multiple levels on which they will interact with the reference architecture depending on the type of system that is being provisioned or the function being automated such as:

  • Own Orchestration Manager or automation engine containing the North Bound API where users will make request. Containing also or connecting to automation adapters, element managers / configuration managers, automation adapters or element/configuration managers that control and execute the automation Internally
  • Devops orchestration solutions that may use a client or agent- based model requiring the agent to be installed on the system and connected securely to an orchestration server executing the desired flows
  • Devops orchestration solutions that may not use a client or agentless model requiring the system to be securely connected to an orchestration server executing the desired flows
  • Integration with Third party “Software Define Everything” Controllers via plugins or other integration points that may control the system as a whole or a portion of its operations
  • As we have leverage some of the common open Management and orchestration solutions like OpenStack, OpenDaylight, Openflow, there is an understanding that the automation capabilities will not only be able to use and interfaces with devops clients, but that depending on the function there would be support for some of the common protocols for configuration and management like Netconf, OVSDB, LISP, YANG, data plane programmability (See Core System Utilities section), etc

Orchestration and Automation requirements

R-10: Support and integration of most common and popular devops orchestration agent or agentless systems in order to only create automation scripts , but also reuse common automation methods and reusable scripts that are available for them

Priority: High

R-20: Support and integration depending on the case of most common and popular Open Automation solutions and Stack platforms

Priority: High

R-30: Support for ”Software Defined Everything” (SDX), as the most commonly used products for automation have been in reference to Software defined Networking, other products for storage and compute may be available, but it may require a hybrid approach using SDX,  Devops and Stack solutions

Priority: High

R-40: Identify Complete Strategic evaluation and position recommendation on how Automation and Orchestration will be utilized in your organization, and what short and long terms plans will be implemented,   what tools will be used to automate the provisioning and Change Management capabilities for Storage/Compute, and network including roadmap, tools and feasibility Analysis

Priority: High

R-50: Secure connectivity between Orchestration Engine or clients to the System being controlled requiring element s of Confidentiality, Integrity and Authorization to gain access

Priority : High

R-60 Role based access Control integrated security for  Automation API where users are controlled on what they can or cannot request and hopefully can be implemented with a security driven solution implementing Authorization as a Service.

Core System Utilities Use Cases

Core System Utilities:

As we have explained  before, we have defined System Utilities, as the  software facilities integrated within an operating environment that perform operational control and management functions to configure and/or maintain the environment according to a defined strategy or policy for use by application processes.

  • In other words, simply put we refer to these as the tools that the vendor or organization creating the platform will include as part of the solution to manage and Control it. They may be thought as proprietor, but they can be as open as it can be. Including on this are, but not limited:
  • UI/API/CLI that will allow to manage the solution and access to the system
  • Control Plane Management and Programmatic API In the event that the solution is a Software Defined Everything where the a specific Controller will be managing the Control Plane directly
  • Possible integration point with Open Data Plane programmatic capabilities or Standards that can program data the plane on different types of systems (As an Example of possible Open solutions mapping to this space  we see the Open Compute Project SAI and other Initiatives that are being used to program the Dat Plane of devices.
  • OS Metrics, Statistics, and Reporting that will allow us to obtain metrics an collect any possible value regarding performance of the system including any information regarding low level functions on the reference architectures
  • Access management, Role Based Access Control capabilities, capability to enforce segmentation and Isolation, full audit capabilities and authorization as a Service for permissions that are required on a per session basis
  • Performance / QOS Metrics, Telemetry, analytics, and Statistics of Core platform
  • Possible Native support for Time series Reporting application for all of the systems metrics that can be reported to a distributed data base or big data solution.
  • Core functionality Ops and Management (Syslog, SNMP, Event MGMT, Traps, Netflow, SFLow,  etc that could be pooled, trapped or constantly streamed as meta data)
    • As An opportunity for next generation monitoring systems, the reporting mechanism that can continuously report time series or other metrics as metadata or continuous information could replace other mechanisms that have not been very efficient over time to detect time sensitive thresholds proactively.

Core System Utilities requirements

R-70: Capability to collect metrics and telemetry from every aspect of the OS features and properties

Priority: High

R-80: Depending on the type of device or Server multiple metrics should be available interchangeably as part of the packages that should cover Storage, Compute and Network

  • Storage: Collection of Metrics related to Storage like input/output (I/O) operations, Response time for operations to be completed, bandwidth being moved , storage interfaces utilization, space availability
  • Compute: CPU, Memory consumption, load averages, number of virtual workloads per hypervisor, resource allocation, number of process an CPU per process
  • Network: SwitchPort Availability, Utilization, QoS Metrics, Queues size and buffers, Routing statistics, etc.

Priority: High

R-90: Availability of an API (Application Programming Interface, and capability to request any action that can be executed including metrics collection on the platform in addition to CLI and GUI.

R-100: Capability to Connect the Control Plane MGMT element via Controller,  API or other mechanisms and be able to provision all new settings, make changes, Add, update or delete settings or previous configuration

Priority: High

R-110: Capability to Connect the Control Plane MGMT element of the system via Controller and execute on-demand  monitoring, configuration management, archiving, and determine issues that the system health may be compromised

Priority: High

R-120: Capability to trigger verifications, forensics,  data collections and on demand auditing capabilities to keep control the service integrity and allow for time-based validation. In addition to possible traffic Captures, re-directions, policy based routing, etc. on-demand or programmatically

Priority: High

R-130: Capability to use Open Standards to program the Data Plane of a device using open standards like OCP SAI, Broadcom NSL, P4 and program the Data Plane of the device.

Priority: High

R-140 Capability to use authorization as a Service based on a Role Based access control work flow to apply policy and permissions of who can access the different applications within the device / server

Priority: High

R-150: Capability to synchronize systems and applications state among  multiple regions Standard and unified security enforcement monitoring throughout entire infrastructure

Priority: High

R-160: Capability to segment and Isolate based on security groups the infrastructure and controlling it for every asset

Priority:High

R-170: Capable to offer a standard AAA (Authentication, Authorization, Accounting) across all platforms, and Role based access control across the entire infrastructure.(Problem today Storage and Compute use LDAP, Network uses TACACs/ RADIUS)

Priority:High

R-180: Offer complete Auditing capabilities of actions executed by every user and any changes or updates on the system

Priority:High

R-190: Offer capability to enforce security and be an element of authorization as a Service where if required every request to change the configuration or make modifications, provisioning or changes will require a separate authorization that can be handled by an automated central system using enforcers

Priority:High

R-220: Reporting mechanism that can continuously report time series or other metrics as metadata or continuous information could replace other mechanisms that have not been very efficient over time to detect time sensitive thresholds proactively.

Priority: High

R-230: Usage and promotion of standardized front Side APIs for compute Storage and Network devices / servers

Priority: High

R-240: Capability to enable Event management via Syslog and SNMP Traps, NetFlow, SFlow,  to either a centralized agent or to a distributed aggregator

Priority: High

R-250: Health-Checking and Report of Status for processes, hosts, storage or operation functions depending on feature basis UP/DOWN proactive and reactive State reporting (pooling/trapping).

Priority: High

R-260: Capability to use Security Information and Event Management , to generate alarms and notifications for condition problems and capability to correlate those alerts with specific actions and specific level of awareness

Priority: High

Complementary / Collateral application Services

On this module we find the Applications or application services that can be integrated or added into the solution as a way to adapt existing tools that can help improve functionality, improve or complement operational Support, Troubleshooting, adapt to existing requirements or even a way to reuse / integrate tools and products that have demonstrated their usefulness or compliance to standards and open solutions

We refer on this category as any additional applications that could be added to the solution as a separate installation on the OS or a possible option that the vendor may allow for apps on the platform (app store like of capability). Could include Open Source or proprietor tools and are not only limited to these.

The difference between this application / utilities and the System Services, is that the System Services can be set to run as privileged mode and assign priority, and resources which in the case of Complementary / Collateral Application may be a best effort solution unless an SLA can be setup with the provider of the system or solution

Some vendors creating or developing systems may select to include an existing well-established or proven tool as one of this functions instead of creating them given how mature some of those features and software are. By the same token, an enterprise can select to install the same app on all Storage / Compute / Network and pull or push any information

Note: In order to stay vendor / product agnostic, we did not include app names or vendors that may apply to this category and only mention a number of open applications in different categories, but in reality you can think on the analogy of the app store, if you can think of a tool you want, there is an app for it or you can create it

  • Metric Collection and reporting of Time Series to Big Data collectors. We see in the future many monitoring agents being able to send data or metadata directly to Hadoop or other big data solutions, and even with present applications like OpenTSDB as an example that can do this as we speak sending time series data for many metrics collected  in keeping a distributed data base
  • Forensics, verification, and Self-Remediation Tools. With programmatic capabilities like SDN and other automation tools we could see monitoring apps gathering forensics, looking for a specific action or event, and triggering or executing actions to either self remediate or mitigate among many other actions that could execute
  • 3rd party agents, Agents that may execute any specific collection or task on the platform to push data to any external platform for collection or any internal process or application that can be running analytics on the data collected
  • Packet Capture tools, that can sniff traffic from either physical Interfaces or Virtual interfaces on demand and store data for troubleshooting purposes
  • SFLOW, Netflow,  IPFix, SNMP, Syslog, Trapping clients, that can execute all of this protocols and send data to multiple receivers
  • Security Applications and Tools
    • Software Compliance and Host check validation
    • Security patches and Vulnerability testing
    • Security Automation for Policy validation
    • Distributed Firewalls and Multi Tenancy Policy Enforcement
    • Isolation Compliance validation
    • Internal Auditing Automation
    • Boundaries and Use Case compliance requiring separation between environments such as Dev, test, QA, Prod, etc
    • Creation of quarantine Zones
    • 3rd Party monitoring and Metrics collection including Network Performance Management and Application Performance Management tools from multiple vendors in the market as apps
    • Service Virtualization / Network Function Virtualization. With the increased capacity on systems and more powerful processors, there is the possibility to install a complete Virtualized image of a service, appliance as a complimentary application that can serve a storage, compute or network function. Any type of virtualized function could be set as another application that could be used for any purpose
    • The difference of setting NFV or SFV at this level or at the System Services level is that as a collateral or complimentary app, we cannot define the priority or the resources that would be used for the service or function and may run as a best effort service

As a general comment, in the past, monitoring solutions on the compute and storage sides have shown a high dependence on syslog or other polling mechanisms, on the network side there is a high dependence on syslog, and SNMP agents for trapping and polling. With the proliferation of APIs and the use on main streams of BigData technologies, we can see a move to enable much more advanced distributed and scalable metric collection from each device or server with new capabilities to extract / post metrics, and possible reports of any type of event management directly into bigdata with no intermediaries or middleware that can execute self-remediation or self-healing by activating automated and programmatic functions.

Independent on whether this application module could allow multiple different types of applications, there is an opportunity to create an open certifiable and testable program where applications can be certified to work with open systems on the Application module.

Complementary / Collateral application Services Requirements

R-270: Availability of tools to collect forensics during specific times when issues or events occur. Capability to auto detect problems based on behavior or heuristics or on demand as other telemetry may be gathered for the resource

Priority: High

R-280: Capability to store or send those forensics to specific locations as metadata or some lightweight protocol that can be easily constructed to recreate the specific problem

Priority: High

R-290: Capability to perform automated event management and alarming based on thresholds or Policy that can be triggered by forensics or other collections, and execute on demand requests to other APIs or to other systems as needed

Priority: High

R-300: Capability to Run Data Captures on demand on well-known open Sniffers like Wireshark, TShark, etc or after certain thresholds or Policies are met

Priority: High

R-310: Transaction validation for application performance across environments (Web, App, DB ) and monitoring the network performance to evaluate multiple environments and determine where the problems are.  A tool with the capabilities to query on a per transaction basis to the members of an application for metrics could identify any issues along the path because all of the disparate elements use the same language and the tool can now build a map of delays and possible causes for the entire layered application flow

Priority: High

R-320: Automated collection of metrics from Incident Management systems which could also have direct agents or clients on the devices to collect and trigger automated steps to determine root cause and fix problems

Priority Med

R-330:  Enable Self Remediation and Self-Healing tools that could use the forensics data and execute a number of steps to solve problems

Priority: High

R-340: Applications and modules could be created and installed on any open Network/Compute/Storage disaggregated server/device to validate many different security requirements

  • Software Compliance and Host check validation
  • Security patches and Vulnerability testing
  • Security Automation for Policy validation
  • Distributed Firewalls and Multi Tenancy Policy Enforcement
  • Isolation Compliance validation
  • Internal Auditing Automation
  • Boundaries and Use Case compliance requiring separation between environments such as Dev, test, QA, Prod, etc
  • Creation of quarantine Zones

R-350: There should be a maximum amount of resources used within the disaggregated device or server that would be dedicated to Additional tools or other apps that are run for monitoring or management purposes on top of the platform

Priority: High

R-360: Use of Either Natively supported application or a 3rd party application to constantly report statistics and metrics back to  the availability check manager or system validating Availability

Priority: High

System Services

Represents system-level run-time processes integrated within the operating environment that performs higher level functions than those provided by the kernel. These services often run in a privileged mode and are typically responsible for the run-time configuration and management of the system (init systems, monitors, load balancing, etc).

On this case the big difference with the Collateral/Complimentary services is that specific functions, products or services can truly be ported with higher level of control on the resources of the system could be shared or be given access to, making it possible to create services Virtualization and a great platform for a combined infrastructure as a service Integration. As an example you could envision for example a Network TOR Switch that could also offer firewall and load balancing capabilities among others, and many other examples applicable to storage and Compute as well.

Some of use cases for System Services

  • ( NFV / SFV ) Network Function Virtualization may seem oriented to Networking, but Service Virtualization can be any function that could be related to compute or Storage as well. In the case of NFV, this is a good use case to demonstrate disaggregation of Services where with the new capabilities of bare metal switches offering more powerful processors, we could install a network Function like a L3 Router, firewall or Load balancer or multiple of these on top of a TOR switch and assign the right level of resources and run these functions on privileged mode to obtain an SLA similar to what a typical standalone device would have
  • Virtual Services, Similar to the previous case, other services related to Compute or Storage or certain peripheral services could be given a higher priority than if they were collateral apps
  • Service Catalog Repository, depending on how the services can be implemented, either on a centralized or distributed way, the catalog of services can be contained into the systems services area so that the SLA of the services and the correct amount of resources can be given to them deterministically.
  • Specialized Functions, on this case, vendors or open systems creators could build / customize or develop elements of the solutions as processes, daemons, functions  or modules that could execute some of the most important functions for the solution to work, as examples a possible Routing management, switching management, storage
  • Availability management, to be used by the solution to perform any type of health checking between the parts of the solution or even synchronization between them
  • Incident management would be specific functionality or module created to determine problems within the solution and a way to collect forensics, troubleshoot or gain intelligence on issues and perhaps repair implementation.
  • Change Management, would be specific functionality to allow for work flow management, audits and visualization of any changes that has been done in the system and even how they can back out a change or backup configurations or execute any changes to the system through this module.
  • Analytics Engine, In this case any of the metrics provided by the other modules and services can be made available to build analytics and business Intelligence out of the system, which may include dashboards and other capabilities to perform mining, filter and display estate allowing polling of the data
  • As the Service Catalog Repository or any Application can be mapped as either existing on the Collateral application service or on the system service module, the system would be able to determine the performance of the application and map it to the policy established to understand the SLA balance and

System Services Requirements

R-370: Capability to monitor application performance, estate, and enforce Service Level Agreements for multiple Tenants that could have presence in the device

Priority Med

R-380: Support and Integration of Devops and other open orchestration and Automation configuration Management tools that can provision and apply changes to servers and devices using either a client based or a clientless model

Priority High

R-390: General 3rd party Applications or Agents could be adapted from specific Vendors or Generic ones for the following cases creating

  • Interoperability and Integration with Capacity Planning tools and capability to pool and query capacity
  • Inventory and Asset Management including CMDB specific Agents
  • Integrated and interoperability with Helpdesk to do automated verifications
  • Disaster Recovery tools or agents  that could be configure to execute a number of steps or checks including simulation of disaster and high availability testing or even real live failover
  • Security validation and vulnerability testing
  • Compliance checks and validation of systems requirements vs real
  • Security on-Demand as host based checks could be performed
  • Priority : High

Automated Provisioning Service

The Automated Provisioning Service, a.k.a (ZTP)  Zero touch provisioning for Compute, Storage or Network that allow for systems to automatically boot, get an address, install the right OS needed, and obtain via automation right configuration with minimal intervention

Automated Provisioning Service Requirements

R-400: Support for Zero Touch Provisioning and automated provisioning / configuration of servers / Devices, and the capability to deploy specific configurations that will build the systems using the ZTP capability with specific configurations and tools that will allow building all of the standard applications needed in a system to provide the desired service management and availability model

 

Appendix A Definitions and Clarification

The expression of general or specific requirements in this document may be articulated using object-state modeling.   Object-state modeling is a technique that enables systems architects and engineers to describe features, capabilities and systemic qualities that are desired in a system.  These needs are expressed along with any constraints that may be imposed by technology, resources, and/or policies governing the development and operational use of the system.  The technique is predicated on the notion that abstract representations of systems and their constituent system elements can be used to model system behavior and interaction with actors (humans and other systems) thereby exposing emergent features, properties and interfaces that are relevant to their function and behavior.  This can help with developing requirements that are articulated in a manner that is understandable, meaningful and testable.

For the purposes of this document, the following definitions and descriptions are introduced:

object – An abstract representation of a tangible or intangible thing.  At a base level, an object can represent a system-of-interest.  At a micro level, an object can represent a system element or constituent component comprising the system-of-interest.  At a macro level, an object can represent a system-of-systems that has relationship to the system-of-interest on a broader scale.  An object can also represent data and its derivatives (information, knowledge, intelligence).  Objects exist in both the physical (tangible) and logical/virtual (intangible) domains.

subject – An abstract representation of an actor (human, machine or other entity operating with intent) that has some relationship with an object.  For systems, this is typically characterized by an access relationship between the subject and object; where the object provides some service to, or on behalf of, the subject.  When subjects interact with objects they affect the state of the object in some way.

object state – A representation of an object’s condition at a given point in time.  Every object has state and every system of any consequence has many states, including those systems designated as “stateless“.  Object state may change in response to inputs introduced to an object by subjects.

object state element – A data element representing some part of an object state as reported by the object.  An object’s state may be defined by one or more object state elements (a collection of state elements).  An object state element is expressed in terms of a parameter with an explicit value, or in the form of a symbol or expression that represents some collective value derived from multiple parameters (online, offline, restarting, nominal, degraded, fault, panic, overload, hot, full, saturated).

object initial state – The condition of an object representing the starting point in a continuum of conditions experienced by the object over time.  Every object has an initial state.  An object’s initial state may or may not be its default state.  The initial state may be relative to a specific context or perspective of the object that establishes a baseline starting point representing a defined condition along a continuum of conditions bounded by two end-points in time.  An object’s initial state may be preceded by a meta-state that describes the non-existence of the object (i.e., prior to creation).

object current state – The condition of an object at a given reference point in time.

object state transition – A change in an object’s current condition (current state) to another condition (transition state) experienced during a defined period of time (switchport up -> switchport down, RAID re-sync in-progress -> RAID re-sync completed, server shutdown initiated -> server halted).

object terminal state – The condition of an object representing the ending point in a continuum of conditions experienced by the object over time.  Every object has a terminal state (although the terminal state may never be reached).  The terminal state may be relative to a specific context or perspective that establishes an ending point representing a defined condition along a continuum of conditions bounded by two end-points in time.  An object’s terminal state may be followed by a meta-state that describes the non-existence of the object (i.e., after destruction)

system – A combination of interacting elements (physical and/or logical objects) organized to achieve one or more stated purposes.  Within the scope of consideration, such elements represent the system-of-interest (server, network switch, storage appliance).

system element – A constituent member (object) or component of a system (e.g., processor, switchport, disk).

system-of-systems – A system whose system elements are themselves systems (e.g., datacenter, cluster,  server farm, LAMP stack).

interface – A shared physical or logical boundary between two or more objects established to facilitate communication, cooperation and/or collaboration between objects.

protocol – An agreed upon set of rules and methods governing the exchange of data and/or information between objects in order to realize an interface.

With these definitions in mind, the following assertions can be made:

Any management framework requires that somebody (a subject) be able to affect change (transition) against some thing (an object).

In order for the change to be deterministic, the current condition (current state) of the object needs to be known to some finite resolution by the subject.

The object needs to extend some method and interface in order for the subject to learn or measure the object’s current state such that the inputs needed to affect the desired change can be determined.

The object needs to extend some method and interface in order for the subject to introduce inputs that will induce the object to respond in a way that changes its state to the condition(s) desired by the subject.

In other words, for any given object, methods and interfaces are needed to read its current state and alter (or transition) its current state to another state.  Metaphorically, these can be thought of as the “dials” and “knobs” of a control panel that someone uses to manage a system.  As such, when requirements are developed for any system, the following questions need to be considered:

What subjects (administrators, provisioning systems, management systems) will be initiating changes against a system (an object)?

What objects (servers, networks, storage) will be considered “in scope” for change (the system-of-interest)?

What are the legal states the object(s) can have and what is needed to facilitate transitions between each legal state?

What is needed in a system to respond to attempts by a subject to transition an object to an illegal state?

What is the expected or likely current state(s) of an object at the time a change is initiated?

What are the changes (state transitions) a subject is permitted to affect against a given object or set of objects?

What information (“dials”) about the object’s state are needed by the subject prior to, and after, initiating the change?

What methods (“knobs”) will be used to introduce inputs to the object that will initiate the change?

What capabilities are needed to respond to exceptions, faults and failures experienced by an object?

 

From this, a set of primitive operations that a subject can apply against an object can be described, and from which more sophisticated operations can be derived:

create – the object is instantiated by the subject; transitioning from the meta-state of non-existence to its initial state;

read – the subject observes or interrogates (queries) the object to learn or measure its current state;

modify – the subject introduces inputs to the object that causes it to react in a way that changes its state;

destroy – the object is terminated (or deleted) by the subject; transitioning from its terminal state to the meta-state of non-existence.

With the foundation described above, general and specific requirements can be articulated for systems providing compute, network and storage services within an IT infrastructure.