Skip to main content

2 posts tagged with "cheatsheet"

View All Tags

· 23 min read
Chris St. John

⚡️ 150+ Solutions Architect metrics/calculations cheatsheet

150+ Solutions Architect metrics and calculations for systems design, technology comparisons, planning and projects. btw, If there is interest, I'll make this into tables, I made a few but haven't had time to do it all yet.

Categories: User, Network, Reliability, Compute, Storage, Database, Queues/Events, Security, Cost.

Thanks for checking it out... if you have ideas for improvements, feel free to comment or make a PR on my github repo: https://github.com/csjcode/solutions-architect-metrics-cheatsheet

⭐️ User

  • Daily Active Users (DAU)
    • Unique active users / day
    • Performance and capacity needs, daily, projections.
  • Monthly Active Users (MAU)
    • Unique active users / month
    • Performance and capacity over longer time periods, projections.
  • Concurrent Users, Avg/Max
    • Number of users at same time, average and peaks
    • Reliability, server capacity, quotas, service bottlenecks
  • Actions Per User (APU): Average actions per user
    • Actions performed by all unique users / by the number of unique users.
    • Performance, bandwidth, concurrency, cost optimization for high/low microservices.
  • Actions Per User delta (APU Δ)
    • Change in actions per user over a given time period
    • Speed of scalability, concurrency, performance.
  • Daily User Actions (DUA)
    • Total actions performed by all users in a day.
    • System load, performance, cost and capacity growth needs.
  • Requests Per Second (RPS)
    • Service requests per second
    • Reliability in high traffic, quotas, performance.
  • User Delta
    • Change in the number of users over a given time period
    • Speed of scalability, capacity planning, concurrency, performance, cost..
  • Session length
    • Average duration of user session
    • Server resources, cost, performance, concurrency planning.

⭐️ Network

  • CIDR/Subnet Calculation
    • IP address and network mask calculation,
    • Number of IPs = 232-2prefix
    • 10.0.0.0/24 = 232-224 = 28 = 256
    • For network sizing and planning
  • Bandwidth Consumed/Utilization
    • (Bandwidth used / Total available bandwidth) x 100%
    • Network resource usage, potential bottlenecks, cost, quotas, performance
  • Available Bandwidth
    • Total available bandwidth - bandwidth used.
    • Cost, performance, quotas, adequate resources for network traffic and avoiding slowdowns.
  • Data Transmission Rate
    • Amount of data transferred / time taken
    • Data transfer speed and network efficiency.
  • Link Capacity
    • Maximum bandwidth capacity of a link.
    • Reliability - determines maximum bandwidth available for a link to ensure it can handle traffic.
  • Network Throughput
    • (Amount of data transferred / Total time) x 100%
    • Performance, operations, network efficiency and data transfer speed.
  • Network Latency
    • Delay between sending and receiving data. Round-trip time (RTT)
    • Performance, delay in data transfer and network responsiveness.
  • Network I/O
    • Input/output rate of network data.
    • Warning of network bottlenecks, performance, cost.
  • Request Error rate
    • Percentage of failed requests, (Number of failed requests / Total requests) x 100%
    • Identifies issues in the network or application that need to be addressed.
  • Packet Loss Rate
    • Percentage of lost packets.
    • Identifies potential vulnerabilities or issues in the network or application.
  • HTTP response codes
    • Partial list, see full list
    • 1xx info, processing
    • 2xx successful: 200 OK
    • 3xx redirection: 301 Moved Permanently, 302 Found, 304 Not Modified
    • 4xx client error: 400 Bad Request (parameters missing?), 401 Unauthorized (API token missing?), 403 Forbidden (permissions?), 404 Not Found (url incorrect?), 405 Method Not Allowed (eg. POST not accepted on resource), 429 Too Many Requests (exceeded rate limits?), 499 Client Closed Request
    • 5xx server errors: 500 Internal Server Error, 501 Not Implemented, 502 Bad Gateway, 503 Service Unavailable (overload?), 504 Gateway Timeout (upstream timeout?)
  • HTTP Header Fields

⭐️ Reliability

  • Recovery Time Objective (RTO)
    • Maximum time window delay within which a service must be restored after a disaster.
  • Recovery Point Objective (RPO)
    • Maximum amount of time before unacceptable amounts of data have been lost due to a disaster, failure, or comparable event
  • Single Point of Failure (SPOF)
    • Component that can cause system failure.
    • 1 - (redundant components/total components).
  • Cloud Services Quotas
    • Limits on usage of cloud services.
    • Possible SPOF, if quota reached, service unavailable/degraded.
  • Availability (% of time)
    • proportion of time a system is operational
    • uptime / (uptime + downtime)
    • Availability (% of requests)
    • proportion of time a system is operational
    • successful requests / (valid requests)
  • Availability in 9s ("nines")
    • Percentage of uptime in a year.
    • (total time - downtime) / total time
  • Maximum Availability with Dependencies
    • Maximum Availability estimate for multiple services in a distributed system.
    • Availability of Service 1 Availability of Service 2 ... Availability of Service n
    • MTBF / MTBF + MTTR
  • Maximum Availability with redundant components
    • Maximum Availability estimate with duplicated components (higher reliability)
    • A = 1-F ≈ 1-f(1-a)s+1
    • where s = spare components, F= failure modes, a= availability (in %)
    • ex: 99.5% availability, with two spares the workload’s availability is A ≈ 1 − (1)(1−.995)3 = 99.9999875% availability
  • Mean Time to Failure (MTTF)
    • Average time until a component fails.
    • total uptime / number of any failures.
  • Mean Time Between Critical Failures (MTBCF)
    • total uptime / number of critical failures.
  • Mean Time to Data Loss (MTTDL)
    • Average time before data loss.
    • MTTDL = (1 - Annualized Rate of Data Loss) / Annualized Rate of Data Loss
    • Annualized Rate of Data Loss = (Total Data Stored) x (Data Loss Rate)
  • Recovery Time (RT)
    • TTR + MTTD + MTTI + MTTRM, where TTR is Time to Respond.
  • Mean Recovery Time (MRT)
    • Average time it takes to recover from a failure.
    • MRT = ∑RT / Number of incidents
  • Mean Time to Detect (MTTD)
    • Average time it takes to detect a failure.
    • MTTD = ∑time taken to detect incidents / Number of incidents.
  • Mean Time to Identify (MTTI)
    • Average time it takes to identify a failure.
    • MTTI = ∑time taken to identify incidents / Number of incidents.
  • Mean Time to Remediate (MTTRM)
    • Average time it takes to fix a failure.
    • ∑time taken to remediate incidents / Number of incidents.
  • Mean Time to Resolve (MTTR)
    • Average time it takes to resolve a failure.
    • MTTR = ∑time taken to resolve incidents / Number of incidents. Mean Time to Respond (MTTR)
    • Average time it takes to respond to a failure.
    • MTTR = ∑time taken to respond to incidents / Number of incidents.
  • Change Failure Rate (CFR)
    • Rate at which changes introduce failures.
    • CFR = Number of failed changes / Number of changes attempted.
  • Defect Escape Rate
    • Rate at which defects escape detection.
    • DER = Number of defects found after release / Total number of defects.
  • Defect Density
    • Number of defects per unit of code.
    • Number of defects / Size of the software.
  • Failure Rate
    • Rate at which components fail.
    • Number of failures / Unit of time
  • Service restoration time
    • Time taken to restore a failed system.
  • Redundancy
    • (Number of redundant (backup) components / Total number of components) x 100%
  • Resiliency
    • Ability to recover from a failure.
    • Availability x Reliability x Maintainability x Recoverability
    • Availability (% time), reliability (probability of working service), maintainability, and recoverability (MTTR)

⭐️ Compute

Many of the following metrics are available in analytics services of the cloud provider tools such as AWS Cloudwatch or service dashboards. Metrics are included here for awareness and a reminder when evaluating Compute resources.

  • CPU utilization
    • CPU usage rate. Monitor performance & efficiency. Optimize performance.
    • Total CPU time / Elapsed time
  • Disk I/O
    • Disk input/output. Measure read/write speeds for the Compute device. Optimize throughput.
    • Total bytes read/written / Elapsed time
  • Network I/O
    • Network input/output. Measure bandwidth usage on the Compute device. Optimize connectivity.
    • Total bytes sent/received / Elapsed time
  • IOPS
    • Input/Output operations/second. Assess data access performance. Optimize throughput.
    • Total operations / Elapsed time
  • Memory utilization
    • RAM usage rate. Assess RAM usage for performance.
    • Total RAM usage / Total RAM available
  • Caching
    • Data storage/retrieval. Improve data performance.
    • Hits / Misses
  • Cost
    • Expense management. Estimate resource expenses.
    • Actual cost/Estimated cost
  • Container density
    • Resource utilization. Optimize resource use.
    • Used resources / Total resources
  • Function duration vs. limits
    • Execution time. Gauge execution time.
    • Analytics or Start time - End time vs. quota
  • Function concurrency
    • Simultaneous operations. Measure concurrency.
    • Analytics or Number of operations per lambda / Elapsed time
  • Function response time
    • Execution time. Evaluate speed. Gauge execution time.

⭐️ Load Balancing

  • Load Balancing Algorithm
    • Algorithm used by the load balancer to distribute traffic
    • Round Robin, Least Connections, Weighted Round Robin, Weighted Least Connections, Dynamic Least Connections, Source IP Hash, Least Time, Least Packets, Agent-Based Load Balancing, URL Hash, Server Affinity (Sticky Sessions)
  • Request Success Rate
    • Number of successful requests/Total number of requests
  • Latency
    • Time taken to serve a request
  • Error Rate
    • Number of failed requests/Total number of requests
  • Connection Count
    • Number of connections between clients and servers
  • Active Connections
    • Number of active connections between clients and servers
  • Backend Server Health
    • Availability and response time of backend servers
  • SSL Handshake Time
    • Time taken to establish a secure connection
  • Connection Rebalancing Time
    • Time taken to rebalance connections across servers

⭐️ Autoscaling

  • Scaling metric
    • A metric that determines when autoscaling should occur, such as CPU utilization or request count
  • Scaling policy
    • A set of rules that define how autoscaling should occur, such as increasing or decreasing the number of instances based on the scaling metric
    • Target Tracking Scaling
      • Adjusts capacity based on target metrics.
    • Step Scaling
      • Adds or removes capacity based on specific thresholds.
    • Simple Scaling
      • Adds or removes capacity based on logging or cloudwatch alarms.
    • Scheduled Scaling
      • Changes capacity at specific times or dates.
    • Predictive Scaling
      • Uses ML to forecast demand and adjust capacity.
    • Dynamic Scaling
      • Resizes based on changing demand and traffic patterns.
    • Capacity Optimized Scaling
      • Provisions instances for optimal cost and performance.
  • Scale-out threshold
    • The threshold value for the scaling metric that triggers scaling out (adding instances)
  • Scale-in threshold
    • The threshold value for the scaling metric that triggers scaling in (removing instances)
  • Cool-down period
    • The period of time after scaling has occurred during which autoscaling is suspended to prevent rapid scaling up and down

⭐️ Elasticity

  • Resource utilization
    • The percentage of available resources (such as CPU or memory) that are currently in use
  • Capacity planning
    • The process of estimating future resource needs based on historical usage patterns and growth projections
  • Time to Scale
    • The time it takes to add or remove resources to meet demand changes
  • Cost Optimization
    • The process of minimizing costs while maintaining the necessary level of elasticity and performance.

⭐️ Database

  • Throughput
    • The amount of data transferred per unit of time
    • Data transferred / time
  • Latency
    • The time it takes to process a request
    • Time to first byte + time to last byte
  • Response time
    • The time it takes to respond to a request
    • Time to last byte - time to first byte
  • Concurrency
    • The number of simultaneous users or connections
    • Simultaneous requests / time
  • Read-to-Write Ratio
    • The ratio of read requests to write requests
    • Read requests / write requests
  • Cache Hit Rate
    • The percentage of data that is retrieved from cache
    • cache hit rate = cache hits / (cache hits + cache misses)
  • Database Connections
    • The number of active database connections
    • Measured using database monitoring tools.
  • Query performance
    • Time to execute a database query
    • Execution time = end time - start time
  • Index usage
    • How frequently an index is used to retrieve data
    • Index usage = (number of times index is used) / (total number of queries)
  • Lock waits
    • Time spent waiting for a locked database object
    • Lock wait time = total time spent waiting for a lock
  • Deadlocks
    • Occurrences of simultaneous locking conflicts in transactions
    • Deadlocks = number of occurrences of simultaneous locking conflicts
  • Data consistency
    • Degree of uniformity and accuracy in data across systems
    • Data consistency = (number of errors detected / total number of checks) * 100
  • Backup and Recovery
    • Time taken to backup and recover data in case of failure
    • Time taken to backup or recover / number of backups or recoveries
  • Database Size and Growth
    • Total size of the database and its growth rate
    • Current size of database + (growth rate * time interval)
  • CAP Theorum
    • Pick two of the following three properties:
    • Consistency: Each read request receives the most recent write or an error when consistency can’t be guaranteed.
    • Availability: Each request receives a non-error response, even when nodes are down or unavailable.
    • Partition tolerance: The system operates despite the loss of messages between nodes.
  • ACID
    • Atomicity: All or nothing. Either all operations succeed or all operations fail.
    • Consistency: Data is consistent before and after the transaction.
    • Isolation: Transactions are isolated from each other.
    • Durability: Once a transaction has been committed, it will remain so, even in the event of power loss or system crash.
  • BASE
    • Basically Available
    • Soft state (may be inconsistent for brief periods)
    • Eventually consistent

⭐️ Storage

General Storage metrics

Applies to most storage mediums, including block, file, and object storage.

  • Data Durability
    • Probability of data remaining intact over time
    • (1 - Annual Failure Rate) ^ Years
  • Latency
    • Time for data to be accessed
    • Total time to read or write data
  • Replication Latency
    • Time for replica data to be transferred or accessed
    • Total time to read or write data after transfer
  • IOPS (Input/Output Operations Per Second)
    • Number of read/write operations per second
    • Total number of operations / time interval
  • Throughput
    • Amount of data transferred per unit of time
    • Total data transferred / time interval
  • Maximum Throughput
    • Amount of data transferred per unit of time
    • Total data transferred / time interval

Object storage

  • Object Storage Utilization
    • Amount of object storage used versus total available object storage
    • Used object storage / Total object storage
  • Object Storage Tier Data Stored
    • Length of time objects are stored in a tier before being transfered/deleted
    • Object transfer or deletion execution duration by tier
  • Object Storage API/request Calls
    • API requests made to object storage service
    • API requests / Time interval
    • PUT, COPY, POST, LIST requests (pricing may be different)
    • GET, SELECT, and all other requests
    • Lifecycle Transition requests
    • Data Retrieval requests
  • Object Storage Latency
    • Time taken for object storage service to process a request
    • Total time for requests / Number of requests
  • Data Transfer per time interval
    • Total amount of data transferred
    • Data Transferred (in bytes) / time interval
  • Object Storage Retention
    • Length of time objects are stored before being deleted
    • Object transfer or deletion execution duration
  • Geographic Put/Get requests
    • Latency of Put/Get requests
    • Latency (in milliseconds) / time interval
  • Availability metrics
    • Number of requests that fail
    • Requests that fail / total number of requests
  • Data Consistency metrics
    • Number of objects that are successfully stored/retrieved
    • Number of objects successfully stored/retrieved / time interval
  • Bandwidth used in/out total, timeframe, region, internet, inside cloud provider
    • Amount of data transferred in/out of object storage
    • Data transferred in/out / time interval
    • There may be different policies per cloud provider.

Disk storage

  • Disk Utilization

    • Amount of disk space used versus total available disk space
    • Used disk space / Total disk space
  • Disk IOPS

    • Number of read and write requests to a disk in a second
    • Number of requests / Time interval
  • Disk Latency

    • Time taken for a disk to process a read/write request
    • Total time for read/write requests / Number of requests
  • Data Replication Latency

    • Time taken to replicate data from one location to another
    • Time for replication completion - Time of data creation
  • Data Replication Bandwidth

    • Amount of data replicated per second
    • Amount of data / Time interval
  • SSD Endurance

    • Amount of data that can be written to an SSD before failure
    • Total bytes written / (Drive size in GB * Drive endurance)
  • Disk Utilization

    • Percentage of disk space used
    • (Amount of space used / Total amount of space) x 100
  • RAID Reliability

    • Probability that the RAID will remain operational
    • (1 - Probability of failure) ^ Number of disks
  • Storage Capacity

    • Total amount of storage space available
    • Amount of space used + amount of space available
  • Block Storage IOPS

    • Number of read and write requests to a block storage device in a second
    • Number of requests / Time interval
  • Block Storage Latency

    • Time taken for a block storage device to process a read/write request
    • Total time for read/write requests / Number of requests
  • Types of RAIDS

RAID LevelDescription
RAID 0Data is striped across multiple disks for increased performance, but offers no redundancy.
RAID 1Data is mirrored across two disks for fault tolerance, but offers no performance improvement.
RAID 5Data is striped across multiple disks with parity information stored on each disk for fault tolerance.
RAID 6Similar to RAID 5, but with two sets of parity information for even greater fault tolerance.
RAID 10A combination of RAID 1 and RAID 0, where data is mirrored and striped for both performance and fault tolerance.
RAID 50A combination of RAID 5 and RAID 0, where data is striped across multiple RAID 5 arrays for increased performance and fault tolerance.
RAID 60A combination of RAID 6 and RAID 0, where data is striped across multiple RAID 6 arrays for even greater performance and fault tolerance.

⭐️ Queues/Events

  • Queue Depth
    • Number of events in a queue waiting to be processed.
    • Total events - Processed events
  • Queue Wait Time
    • Amount of time an event spends waiting in a queue before being processed.
    • Total time events spend in queue / Number of events in queue
  • Event Arrival Rate
    • Rate at which events are arriving at a queue.
    • Number of events arriving / Time interval
  • Event Processing Time
    • Amount of time it takes to process an event.
    • Total time spent processing events / Number of events processed
  • Event Processing Rate
    • Rate at which events are being processed.
    • Number of events processed / Time interval
  • Queue Processing Rate
    • Rate at which events are being processed from a queue.
    • Number of events processed from queue / Time interval
  • Queue Time
    • Total time that events spend in a queue, including both wait time and processing time.
    • Queue Wait Time + Event Processing Time
  • Queue Throughput
    • The rate at which events are moving through a queue, including both incoming and outgoing events.
    • Incoming Event Rate + Outgoing Event Rate
  • Event Drop Rate
    • the rate at which events are being dropped or lost, typically due to queue overflow.
    • Number of dropped events / Total number of events
  • Queue Latency
    • the time it takes for an event to travel through a queue, including both wait time and pr**ocessing time.
    • Queue Time / Number of events

⭐️ Security

  • Network Security Score: a metric that measures the security posture of a network, including factors such as the number of vulnerabilities, exposure to threats, and compliance with security standards.

  • Incident Response Time

    • The amount of time it takes to respond to a security incident
    • Detection Time + Response Time + Mitigation Time
    • Time Detected - Time Reported
  • Risk Assessment Score

    • A numerical score based on a risk assessment methodology such as the NIST Risk Management Framework
    • (risk rating x probability of risk) + (residual risk x probability of residual risk)
    • Number of Potential Risks / Number of Acceptable Risks
  • Attack Surface

    • The total number of entry points or attack vectors available to attackers
    • Attack surface = sum of (threats x vulnerabilities)
  • Vulnerability Assessment Score

    • A numerical score based on a vulnerability assessment methodology such as CVSS
    • Common Vulnerability Scoring System [CVSS calculator(https://www.first.org/cvss/calculator/3.1)
    • Potential Vulnerabilities/Number of Acceptable Vulnerabilities
    • CVSS: (Base Score + Temporal Score + Environmental Score)
  • Access Control Effectiveness

    • The ability of the access control system to protect the system from unauthorized access
    • (Authenticated Access Attempts - Unauthorized Access Attempts) / Authenticated Access Attempts
    • Number of Access Control Rules/Number of Access Control Rules Enforced
  • Authentication Effectiveness

    • The ability of the authentication system to accurately identify and authenticate users
    • Authentication Effectiveness = (Authenticated Users - Unauthenticated Users) / Authenticated Users
  • Authorization Effectiveness

    • The ability of the authorization system to accurately authorize users to access resources
    • Number of Authorization Controls/Number of Authorization Controls Enforced
  • Security Audit Log Analysis

    • The ability of the security audit system to detect, monitor, and analyze security events
    • (Audited Events - Unaudited Events) / Audited Events
    • Number of Security Events Detected/Number of Security Events Recorded
  • Security Incident Rate

    • Rate of security incident per time interval
    • Security incidents / time
  • Security Compliance Score

    • Score of how well the system complies with security policies and best practices
    • Compliance score = (number of compliant components / total number of components) x 100
  • Security Training Effectiveness

    • Measurement of how well users understand and adhere to security policies
    • Training effectiveness = (number of users who successfully complete security trainings / total number of users) x 100
  • Vulnerability Scanning Frequency

    • Measurement of how often the system is tested for security vulnerabilities
    • Scanning frequency = (number of scans performed in a given time period / total time period)
  • Identity and Access Management (IAM) roles and permissions audit

    • Measurement of the accuracy and security of IAM roles and permissions
    • IAM audit = (number of correct roles and permissions / total number of roles and permissions) x 100
  • Key Management Service (KMS) usage and audit

    • Measurement of the accuracy and security of KMS usage
    • KMS audit = (number of correct KMS usage / total number of KMS usage) x 100
  • Security Information and Event Management (SIEM) alerts and monitoring

    • Measurement of the accuracy and security of SIEM alerts
    • SIEM monitoring = (number of correctly triggered alerts / total number of alerts) x 100
  • Threat intelligence feeds integration and usage

    • Measurement of how threat intelligence feeds are used to help identify and respond to security threats
    • Number of threat intelligence feeds used / total number of threat intelligence feeds available
  • Encryption key rotation frequency

    • Measurement of how often cryptographic keys are changed
    • Time interval between key changes
  • Compliance posture

    • Measurement of how well the system complies with a security standard
    • Number of security standards met / total number of security standards
  • Network traffic monitoring and analysis

    • Measurement of the ability to monitor and analyze network traffic for suspicious activity
    • Amount of network traffic monitored / total network traffic
  • User behavior analytics (UBA) and anomaly detection

    • Measurement of the ability to detect anomalous user behavior
    • Number of anomalies detected / total user behavior events
  • Data Loss Prevention (DLP)

    • The practice of preventing sensitive data from leaving the organization
    • DLP = implementation of policies and technologies + monitoring of user activity
  • Data Encryption

    • The practice of transforming sensitive data into an unreadable format
    • Data encryption = implementation of encryption algorithms + encryption of data

⭐️ Cost

I am only going to give some brief metrics on Cost, because almost everything above can affect cost and it will vary a lot between providers.

This is not to minimize cost, it's one of the most important factors!!!

Just that cost should be considered on ALL the metrics above.

  • Total Cost of Ownership (TCO) = (cost of acquisition + cost of operation + cost of maintenance) over the useful life of the asset
  • Cost per transaction = total cost / number of transactions
  • Cost per unit of time = total cost / time period
  • Cost per user = total cost / number of users
  • Return on Investment (ROI) = (gain from investment - cost of investment) / cost of investment
  • Cost of Downtime (CoD) = (lost revenue + recovery costs + damage to brand reputation) / total downtime hours
  • Cost of Poor Quality (CoPQ) = (internal failure costs + external failure costs + cost of appraisal + cost of prevention) / total number of units produced
  • Cost of Delay (CoD) = (value of time saved by earlier release - cost of delay) / time saved

There are a ton of cloud cost tools but these are some of the popular ones on the biggest platforms (there are many more if you search on their sites):

  • AWS
    • AWS Cost Explorer
    • AWS Budgets
    • AWS Trusted Advisor
  • Azure
    • Azure Cost Management + Billing
    • Azure Advisor
    • Azure Service Health
  • GCP
    • GCP Billing
    • GCP Pricing Calculator
    • GCP Cost Management
  • Third-party
    • CloudCheckr
    • CloudHealth by VMware
    • Apptio Cloudability
    • CloudBolt
    • CloudZero

⭐️ References

Thanks for checking it out... if you have ideas for improvements, feel free to comment or make a PR on my github repo: https://github.com/csjcode/solutions-architect-metrics-cheatsheet

· 23 min read
Chris St. John

This list is a starting point for estimates!

  • My goal is to provide an easy reference for initial calculations.
  • Many metrics have variations. I didn't have time/space to list all of them! Feel free to suggest important additions.
  • Some metrics are more useful than others. All have limitations. Customize to your use case.
  • This is for technical considerations, not marketing metrics (I did not cover them).

6 User Metrics for Growth/Projections

Most of these user calculation relate to right-sizing the app for the number of users we expect to have.

We need this user information to make sure we have the necessary right-sized resources to support the number of users we expect to have.

You can get the data for these metrics from your analytics tools, such as log analysis, javascript tracing, market analytics/research and/or simulated tests. Other related applications log or usage data that may help.

Remember, the more accurate your projections are, the better you will be able to plan resources effectively for performance, reliability and cost.

1. Daily Active Users (DAU)

The number of unique users who have engaged with the app in a day. Also, it can be a sum of all days in a period and expressed as an average daily number for that period of time.

DAU = Unique users / day

DAU avg = Unique users summed during time period / time period

(if daily avg. for multiple days, then divide the sum of DAUs each day by the number of days)

DAU helps estimate resource requirements. However, it's of limited help for specific usage patterns like spikes, or heavier traffic days vs. low traffic. So it's best as a starting point only and rough estimates.

It can be an average daily value over some period of days. For example looking at 3 days, if an app had 10,000 unique users on Monday, 12,000 unique users on Tuesday, and 7,000 unique users on Wednesday, then the DAU sum for those three days in the week so far would be 29,000, so the DAU average would be 29,000 / 3 = 9,667 DAU.

2. Monthly Active Users (MAU)

The formula for MAUs is the number of unique users who have engaged with the app in a given month.

It's different from DAU in that it's removing duplicate users from each day and is a rolling average over a month.

MAU = Unique monthly users / month

MAU at end of year = MAU at start of year * (1 + growth rate)^number of months

Cumulative MAU Growth projection = (MAU at end of year - MAU at start of year) / MAU at start of year * 100

MAU is useful when you expect each unique user to have a fixed requirement such as storage, or a specific amount of data to be processed.

Make sure to use unique users for MAU. If using marketing analytics data, this number is usually available using cookies and other tracking id means. Also, you can use a javascript tracking tool like Google Analytics or New Relic Browser to get this data.

3. Concurrent Users

The number of concurrent users is the number of users who are using the app at the same time.

Concurrent users = Active user sessions / time period

While this is a useful metric, it can be difficult to get an accurate number/sample for different scenarios/times. You may need to do some log analysis based on time period examples or use other analytics tools to get this data.

This is a highly useful metric if you are trying to estimate capacity to handle traffic spikes, bottlenecks, throughput, service quotas and scalability of the app.

Traffic patterns can greatly affect your app. Think of a special event, or a holiday, or a new product launch. The app may work well with 2,000 concurrent users on a normal day but spiking to 20,000 in a short period of time can cause major problems!

While you may have had enough capacity by DAU/MAU standards, you may not have enough capacity to handle the concurrent traffic spikes.

4. Actions Per User (APU)

Actions Per User (APU) is a metric measuring the average number of actions taken by each user within a specific time period.

APU = Total number of actions / total number of unique users

This calculation is useful for many reasons. For capacity planning it can help determine bandwidth, storage, read/write capacity etc.

You may wish to calculate average cost in resources per action in an effort to determine how to reduce costs. For example, if you have a cost of $0.02 per action, then you can estimate the cost of the app by multiplying the APU by the number of users. Then implement a way to reduce per action costs.

5. Daily User Actions (DUA)

Measures the number of actions taken by users on a website or mobile app within a specific time period, typically per day. It is used to measure engagement and activity on a website or mobile app.

DUA = Total Number of Actions / Total Number of Days

6. Requests Per Second (RPS)

Requests Per Second (RPS) measures the number of requests made to an application or website within a specific time period, typically per second.

It is a good metric to look at for individual service requests, such as a web page request, or a database query.

10 Network Calculations

note: This is assuming a host provider has been selected and some network infrastructure is already in place. These do NOT include all telco/router infrastructure.

If you are building a network from scratch then you will need many more considerations outside the scope of this list.

1. CIDR/Subnet Calculation

This is the subnet calculation for a given CIDR prefix and number of IPs available.

Number of IPs = 232-2prefix

(where prefix is the number after the slash in the CIDR notation)

Examples:

  • 10.0.0.0/28 = 232-228 = 24 = 16 ips available
  • 10.0.0.0/26 = 232-226 = 26 = 64 ips available
  • 10.0.0.0/24 = 232-224 = 28 = 256
  • 10.0.0.0/20 = 232-220 = 212 = 4,096
  • 10.0.0.0/16 = 232-216 = 216 = 65,536
  • 10.0.0.0/12 = 232-212 = 220 = 1,048,576

Most cloud providers reserve at least the first and last IP in a subnet for internal use. So in that case you would have 62, 254, 4,094, or 65,534 IPs available depending on the prefix.

Also, some providers may further restrict your range for smallest or biggest. For example, AWS may restrict you to a range of /28 to /16. (check for current restrictions)

2. Bandwidth Consumed/Utilization

Depending on your system design this can help with determine necessary capacity of network components such as routers, switches, and firewalls.

Bandwidth consumed = (Total data transferred / Total time)

Bandwidth Utilization = (Data transmitted in a certain time period / Total available bandwidth) * 100%

Accurate bandwidth projections help in monitoring network performance and identifying bottlenecks.

Also this would be very important for high bandwidth activities like streaming video or audio and gaming.

Also it can help in defining Service Level Agreements (SLAs) with service providers.

3. Available Bandwidth

Available bandwidth = (Total network capacity - Total bandwidth consumed)

This can be used to determine how much bandwidth is available for other activities.

Bandwidth can be affected by many factors such as physical infrastructure limits, Quality of Service (QOS) issues, network congestion, network traffic, software/hardware limitations, encryption overhead, provider contraints, and network latency.

4. Data Transmission Rate

This formula calculates the rate at which data is transmitted over a network link. The formula is:

Data Transmission Rate = Data size / Transmission time

This is another one particularly useful for high bandwidth activities like streaming video or audio and gaming.

This formula calculates the capacity of a network link. The formula is:

Link Capacity = Bandwidth * Data transmission rate

6. Network Throughput

There are a couple variations to this formula. The most common is:

Network Throughput = Total Data Transferred / Total Time Elapsed

Network Throughput = Data size / Transmission time

For example, if a system transferred 1000 bytes of data in 2 seconds, the throughput would be:

7. Network Latency

There are several ways to measure this but here are a couple popular ones:

Network latency = Transmission time - Processing time

Network latency = (Transmission time + Propagation time + Queuing delay) / 2

Round Trip Time (RTT) = Latency * 2

Jitter = Maximum Latency - Minimum Latency

8. Network I/O:

The number of bytes sent and received by the instance's network interfaces.

Also could use the network packets: The number of packets sent and received by the instance's network interfaces.

9. Request Error rate

The number of times a request responded with an error .

Also could use other metrics such as Status check failures, that the instance's status check has failed.

10. Error response codes

  1. 3x Error Codes: 3xx status codes indicate that the request needs further action from the client in order to be fulfilled. Some common 3xx status codes include:
    • 301 Moved Permanently
    • 302 Found
    • 304 Not Modified
  2. 4x Error Codes: 4xx status codes indicate that there is an error with the client's request. Some common 4xx status codes include:
    • 400 Bad Request - missing required parameters
    • 401 Unauthorized - often due to lack of a API/user token
    • 403 Forbidden - often a permissions issue
    • 404 Not Found - the requested resource does not exist
    • 429 Too Many Requests - often due to exceeding rate limits
    • 499 Client Closed Request - often due to a timeout
  3. 5x Error Codes: 5xx status codes indicate that there is an error with the server. Some common 5xx status codes include:
    • 500 Internal Server Error - often a server-side issue
    • 501 Not Implemented - the server does not support the functionality required to fulfill the request
    • 502 Bad Gateway - the server was acting as a gateway or proxy and received an invalid response from the upstream server
    • 503 Service Unavailable - the server is currently unavailable (overloaded or down)
    • 504 Gateway Timeout - gateway or proxy timeoutfrom the upstream server

Reliability

Mean Time Between Failures (MTBF)

This metric measures the average time that a system operates without a failure.

MTBF = Total Operating Time / Number of Failures

A high MTBF value indicates a more reliable system.

Mean Time to Repair (MTTR)

The average time that it takes to repair a failed system, calculated as

MTTR = Total Repair Time / Number of Failures.

A low MTTR value indicates a more efficient and effective process for repairing failed systems.

Availability

This metric measures the percentage of time that a system is available for use, calculated as (Total Operating Time / Total Elapsed Time). High availability is important to ensure that your systems are functioning optimally and delivering the required level of service to users.

Availability (as %) = Available for use time / Total time

Availability = MTTF / (MTTF + MTTR)

Availability is usually expressed in % or the shorthand of 9s.

Availability in 9s ("nines")

Availability Time in 9s = amount of time a system is available to users / by the total time * 100%

Availability Time in 9s = ( / (8 hours + 2 hours)) * 100% = 80%

According to Amazon AWS documentation:

AvailabilityMaximum Unavailability (per year)Application Categories
99%3 days 15 hoursBatch processing, data extraction, transfer, and load jobs
99.9%8 hours 45 minutesInternal tools like knowledge management, project tracking
99.95%4 hours 22 minutesOnline commerce, point of sale
99.99%52 minutesVideo delivery, broadcast workloads

99.999% 5 minutes ATM transactions, telecommunications workloads

Availability with Dependencies

(source: AWS docs) Availability = Availability of App  Availability of Dependency 1 Availability of Dependency 2 * (... Availability of Dependency n)

Calculating availability with redundant components.

(source: AWS docs) Availeffective = AvailMAX − ((100%−Availdependency)×(100%−Availdependency))

Shortcut calculation: If the availabilities of all components in your calculation consist solely of the digit nine, then you can sum the count of the number of nines digits to get your answer. In the above example two redundant, independent components with three nines availability results in six nines.

Service restoration time

The time it takes to restore a service after a failure.

This is important because it helps estimate the impact of a failure on the business.

For example, if a critical server is likely to need rebooting once per quarter

Recovery Time Objective (RTO)

RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable. RTO is similar to MTTR (Mean Time to Recovery)

Recovery Point Objective (RPO)

RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.

  • Error Rate

This metric measures the number of errors that occur in a system, calculated as (Number of Errors / Total Requests). High error rates can indicate issues with the system's design or implementation, and it's important to monitor this metric to ensure that your systems are functioning correctly.

Single Point of Failure (SPOF)

This metric measures the likelihood that a single component failure will cause a system to fail, calculated as (Number of Components in the System / Number of Critical Components). A high SPOF value indicates that a system is highly vulnerable to failure, and it's important to design systems with a low SPOF value to ensure high reliability.

Redundancy

This metric measures the level of redundancy in a system, calculated as (Number of Redundant Components / Total Number of Components). A high redundancy value indicates that a system is well-designed to withstand failures, and it's important to design systems with a high level of redundancy to ensure high reliability.

Capacity Utilization

This metric measures the utilization of system resources, calculated as (Total Utilization / Total Available Capacity). High capacity utilization can indicate that a system is overburdened and may not be able to perform optimally, so it's important to monitor this metric to ensure that your systems are functioning correctly.

Compute

IOPS (Input/Output Operations Per Second)

This refers to the number of read and write operations that can be performed by a storage device in one second. This is an important metric for storage arrays used in high-performance computing or database applications.

  • Divide the total number of I/O operations performed by the time taken to perform those operations. For example, if a storage system performs 1000 I/O operations in 5 seconds, the IOPS would be 200 (1000 operations / 5 seconds = 200 IOPS). In this example, the units used to express IOPS are operations per second (ops/s). The IOPS value is calculated by dividing the number of I/O operations by the time taken to perform those operations.

Memory utilization: The percentage of available memory that is in use by the instance.

  • Divide the amount of memory used by the total amount of memory available in the instance. For example, if an EC2 instance has 4 GB of memory and is currently using 2 GB, the memory utilization would be 50% (2 GB / 4 GB = 50%).

  • CPU utilization: The percentage of allocated compute units that are currently in use by the instance.

    • Divide the number of CPU cycles used by the total number of CPU cycles available in the instance. For example, if an EC2 instance has 4 vCPUs and is currently using 2 vCPU hours, the CPU utilization would be 50% (2 vCPU hours / 4 vCPUs = 50%).
  • Concurrent connections: The number of concurrent connections to the compute instance.

    • To calculate concurrent connections, simply count the number of active connections established between a client and a server at a given time. For example, if there are 10 active connections established between a client and a server at a given time, the number of concurrent connections would be 10.
  • Load average: A metric that indicates the average system load over a specified time period.

    • Load Average: Load average is calculated as the average number of processes waiting for CPU time over a specified time period. Typically, the load average is calculated over a 1-minute, 5-minute, and 15-minute time period. For example, if the 1-minute load average is 1.0, this means that, on average, there was 1 process waiting for CPU time over the past minute.
  • CPU credits: The number of CPU credits consumed by the instance.

  • CPU balance: The number of CPU credits remaining for the instance.

  • Lambda invocations: The number of times that the Lambda function has been invoked.

  • Lambda duration: The amount of time that the Lambda function has been running.

  • Concurrent executions: The number of executions of your Lambda function that are currently running. This metric can be used to track the load on your function and ensure that it has sufficient capacity to handle incoming requests.

  • Throttles: The number of times that the Lambda service has throttled the execution of your function. This can occur when your function exceeds the maximum number of allowed concurrent executions.

  • Dead Letter Errors: The number of times that an event was unable to be processed by your function and was sent to the dead-letter queue.

Queue/Event Processing

  • Queue depth: The number of messages or events that are currently waiting to be processed in the queue. This metric can be used to monitor the load on your system and ensure that it has sufficient capacity to handle incoming requests.

  • Retries: The number of times that a message or event has been retried due to a processing error. This metric can be used to monitor the reliability of your system and ensure that messages or events are being processed successfully.

  • Dead letters: The number of messages or events that have been moved to a dead-letter queue due to a processing error. This metric can be used to monitor the reliability of your system and ensure that messages or events are being processed successfully.

  • Consumer count: The number of consumers that are currently consuming messages or events from the queue. This metric can be used to monitor the load on your system and ensure that it has sufficient capacity to handle incoming requests.

  • Consumer utilization: The percentage of time that consumers are actively processing messages or events. This metric can be used to monitor the utilization of your system's resources and identify any potential performance bottlenecks.

  • Consumer Request Rate: The rate at which consumers are reading messages from Kafka topics, measured in messages per second. This metric can be used to monitor the performance of your system and identify any potential bottlenecks.

  • Replication Lag (Kafka): The amount of time that it takes for messages to be replicated from the leader broker to the followers, measured in milliseconds. This metric can be used to monitor the performance of your system and ensure that messages are being replicated in a timely manner.

  • Offset Lag (Kafka): The difference between the latest offset in a topic and the offset of the latest message consumed by a consumer group. This metric can be used to monitor the progress of consumers and ensure that they are processing messages in a timely manner.

  • Under-Replicated Partitions (Kafka): The number of partitions in a Kafka topic that do not have the required number of replicas. This metric can be used to monitor the health of your system and ensure that messages are being replicated correctly for high availability.

  • Broker Capacity (Kafka): The utilization of disk, CPU, and memory resources on each Kafka broker. This metric can be used to monitor the utilization of your system's resources and identify any potential performance bottlenecks.

Storage

  • Cost per gigabyte: This refers to the cost of storing one gigabyte of data. It is an important metric for evaluating the cost-effectiveness of different storage solutions.

  • RAID (Redundant Array of Independent Disks): This refers to a method of using multiple disk drives to provide data protection in case of disk failure. Different RAID configurations provide different levels of data protection and performance.

  • Disk read and write operations: The number of read and write operations performed on the instance's attached disks.

  • Read and write latency: The amount of time that it takes to perform a read or write operation on the instance's attached disks.

  • Read and write throughput: The amount of data that can be read or written to the instance's attached disks in a given time period.

  • Read and write IOPS: The number of read and write operations that can be performed by the instance's attached disks in a given time period.

  • Data Transfer In: The amount of data that is transferred to the instance.

  • Data Transfer Out: The amount of data that is transferred from the instance.

  • Disk Space Utilization: The amount of disk space that is being used on your storage system, measured as a percentage. This metric can be used to monitor the utilization of your storage resources and ensure that you have sufficient capacity.

  • Disk Space Available: The amount of disk space that is available on your storage system, measured in bytes or gigabytes. This metric can be used to monitor the utilization of your storage resources and ensure that you have sufficient capacity.

  • Object Count: The number of objects that are stored in your object storage system. This metric can be used to monitor the utilization of your storage resources and ensure that you have sufficient capacity.

  • Object Size: The average size of objects stored in your object storage system, measured in bytes. This metric can be used to monitor the utilization of your storage resources and ensure that you have sufficient capacity.

Caching

  • Cache Hit Ratio: The ratio of requests that are served from your CDN cache, as opposed to being fetched from the origin server. This metric can be used to monitor the efficiency of your CDN cache and identify any issues with cache hit rates.

  • Origin Latency: The amount of time it takes for your CDN to retrieve content from the origin server, measured in milliseconds. High origin latency can indicate a performance issue with your origin server or network connectivity.

  • Edge Latency: The amount of time it takes for your CDN to deliver content to the user, measured in milliseconds. High edge latency can indicate a performance issue with your CDN distribution or network connectivity.

  • Invalidation Count: The number of invalidations that have been requested for your CDN distribution. This metric can be used to monitor the efficiency of your CDN cache and identify any issues with cache hit rates.

  • Connected Clients (Redis): The number of clients that are connected to your Redis instance. This metric can be used to monitor the utilization of your system's resources

  • Operations Per Second (Redis): The number of operations that Redis is processing per second. This metric can be used to monitor the performance of your Redis instance and identify any issues with resource utilization.

  • Expired Keys (Redis): The number of keys that have expired and been deleted from Redis. This metric can be used to monitor the performance of your Redis instance and ensure that it is functioning optimally.

Cost

  • Resource Utilization: This metric measures the utilization of compute, storage, and network resources in your cloud environment. By monitoring resource utilization, you can identify resources that are underutilized or overutilized, and adjust your architecture accordingly to optimize costs.

  • Data Transfer: This metric measures the amount of data transferred in and out of your cloud environment. By monitoring data transfer, you can identify areas where you may be able to optimize your architecture to reduce data transfer costs.

  • Storage Costs: This metric measures the cost of storage resources in your cloud environment, including the cost of storing data and backups. By monitoring storage costs, you can identify areas where you may be able to optimize your architecture to reduce storage costs.

    • Storage Utilization: To calculate storage utilization, divide the total amount of storage used by the total amount of storage available in your environment. For example, if your environment has 1 TB of storage and you used 800 GB in a given period, your storage utilization would be 80% (800 GB / 1 TB = 80%).
  • Compute Costs: This metric measures the cost of compute resources in your cloud environment, including the cost of running virtual machines, containers, and serverless functions. By monitoring compute costs, you can identify areas where you may be able to optimize your architecture to reduce compute costs.

    • Compute Utilization: To calculate compute utilization, divide the total number of CPU or vCPU hours used by the total number of CPU or vCPU hours available in your environment. For example, if your environment has 10 vCPUs and you used 8 vCPU hours in a given period, your compute utilization would be 80% (8 vCPU hours / 10 vCPUs = 80%).
  • Network Costs: This metric measures the cost of network resources in your cloud environment, including the cost of data transfer and the cost of using cloud-based network services such as VPNs and load balancers. By monitoring network costs, you can identify areas where you may be able to optimize your architecture to reduce network costs.

    • Network Utilization: To calculate network utilization, divide the total amount of data transferred by the maximum capacity of your network. For example, if your network has a maximum capacity of 1 Gbps and you transferred 800 Mbps in a given period, your network utilization would be 80% (800 Mbps / 1 Gbps = 80%).
  • Usage Trends: This metric measures the usage trends for your cloud environment over time, including the trend for resource utilization, data transfer, storage, compute, and network costs. By monitoring usage trends, you can identify areas where your costs are increasing and take steps to address the underlying issues.

  • Cost Allocation: This metric measures the allocation of costs across different departments, teams, or applications within your organization. By monitoring cost allocation, you can ensure that costs are fairly distributed and that each team or application is paying for the resources it uses.

Security

https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/establish-metrics-for-success.html

Mean time to detect Mean time to acknowledge Mean time to respond Mean time to contain Mean time to recover Attacker dwell time Use indicators of compromise (IOCs)

Metrics summary