ScienceSoft’s Approach to Application Performance Management
Since 1989, ScienceSoft has been delivering high-performing and stable applications that are easy to maintain and scale on demand. Our performance management practices are directly informed by software engineering know-how, real-world testing, and ongoing feedback from both users and stakeholders.
3 Pillars of Our Approach to Application Performance Management
APM is a collaborative effort between development and operations teams, QA engineers, and business stakeholders. In this process, product owners define KPIs aligned with business needs and work with developers to prioritize performance improvements, using insights from both QA and operations teams.
We tailor APM strategies to the software type and scale, domain specifics, project budget, and short- and long-term business objectives. This ensures that performance management aligns with operational priorities without overspending.
To maximize impact and minimize costs in APM, we focus on high-value activities: writing efficient code, detecting errors early, and closely monitoring core functions and high-traffic areas. We also implement canary releases or blue-green deployments and set dynamic alert thresholds.
How We Monitor, Troubleshoot, and Enhance Application Performance
To ensure lasting speed and efficiency of your applications, we follow a proactive, layered strategy that combines:
Continuous real-time system performance monitoring
- By implementing solutions like Datadog, New Relic, and Dynatrace, along with custom tools and plugins, we monitor performance across the entire technology stack — from the front end, back end, and service mesh to the data storage and the underlying infrastructure. Among key metrics are response time, request throughput, error rates, and system resource utilization (CPU, memory, I/O).
- For network-heavy applications, we monitor network latency and throughput to detect bottlenecks and optimize traffic flow.
- For cloud services, we configure tools like Amazon CloudWatch or Azure Monitor to keep an eye on instance health, scaling activities, and service availability.
- We monitor external dependencies (e.g., payment gateways, external APIs, or third-party software) to detect potential performance bottlenecks originating outside our application. For example, we can simulate transactions to ensure that third parties respond as expected. If these third-party services slow down or fail, alerts signal the team to investigate.
Weighing the gains in visibility against any negative impact on performance
While logging and monitoring tools provide essential visibility into system health, they also introduce additional system load. This load can consume valuable network bandwidth, increase CPU and memory usage, and slow down response — especially critical in high-performance or real-time applications. To counter this, we strategically place instrumentation only on the most crucial paths, ensuring minimal impact on system performance while still gathering the necessary data.
Synthetic and real user monitoring
- We set up synthetic transactions that simulate user interactions, allowing us to test performance from multiple global locations in a controlled manner, identify issues before users are impacted, and refine and adjust performance baselines.
- We leverage real user monitoring (RUM) and record user sessions to collect data from user interactions and understand the actual user experience.
Automated alerts and anomaly detection
- We establish alerts for critical KPIs like application latency, transaction errors, and database connection failures to prevent performance degradation from impacting user experience.
- We use custom alerts and anomaly detection thresholds to immediately flag potential issues based on historical performance trends. These automated alerts help detect and react to unusual patterns, such as spikes in error rates or sudden increases in resource usage. We design alerting systems to minimize false positives, understanding that if the number of “spam” alerts is overwhelming, teams may start ignoring them, potentially missing critical issues.
Why we keep alerting thresholds dynamic
We adapt thresholds to variable system loads to ensure that our teams focus on actual performance issues that impact the UX or system reliability. For example, during off-peak hours, system resource usage might naturally decrease, so we adjust thresholds to a higher tolerance, reducing unnecessary alerts. Conversely, during peak hours or high-traffic events (e.g., Black Friday sales), we tighten thresholds to catch small performance dips that would otherwise go unnoticed.
Diagnostics and troubleshooting
- We leverage distributed tracing to identify the root causes of slowdowns or bottlenecks in complex, microservices-based architectures. This helps pinpoint which specific services or dependencies need attention.
- We implement log management solutions, such as ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk, which aggregate and analyze logs across the stack, enabling efficient diagnostics and higher visibility.
- We implement automated corrective actions to ensure systems can recover from common failures on their own. For instance, we configure load balancers to reroute traffic away from failing instances and use auto-scaling groups to adjust resource allocation based on demand.
- We set clear response and resolution times for handling performance issues and create comprehensive runbooks with step-by-step guides for diagnosing and resolving typical problems. Additionally, we establish on-call schedules, ensuring that skilled personnel are available to promptly address any performance incidents.
Capacity management and optimization
- By regularly reviewing infrastructure utilization data, we ensure that resources aren’t under- or over-allocated. This includes scaling compute instances, optimizing storage, and tuning network settings as necessary.
- We schedule periodic performance reviews to analyze resource consumption trends, enabling informed decisions on infrastructure scaling to ensure efficiency and cost control. This is especially important for the cloud, where misconfigured and inappropriate services can not only lead to suboptimal performance but also quickly escalate cloud costs.
Disaster recovery
- We develop tailored disaster recovery strategies with clear recovery time and recovery point objectives (RTO and RPO).
- We conduct quarterly or semiannual disaster drills, including unannounced tests, to validate migration and switching protocols.
Detailed logs of all recovery activities are maintained for ongoing audits and process improvements.
Continuous improvement and feedback loops
- We continuously optimize code and configurations based on monitoring insights and evolving usage patterns.
- We keep third-party libraries and dependencies updated to benefit from performance improvements in their new versions.
- We create tailored dashboards and reports for key stakeholder groups to keep them informed about the performance and health of their applications, past incidents, performance considerations, trade-offs, and optimizations.
Sample Performance Benchmarks We Set Based on App Type
Web portals
- Page load time: under 1 sec.
- Time to First Byte (TTFB): under 200 ms.
- Requests per second (RPS): from 100 to 1,000 RPS for standard apps; up to 100,000 RPS for high-traffic applications.
- Error rate: below 1% (0.1% or lower for critical applications).
- Database query performance: under 100 ms; <1 ms for frequently used queries.
Ecommerce applications
- Page load time: under 2 sec even during high traffic; checkout pages under 1.5 sec.
- Transaction processing time: under 3 sec.
- Search query response time: under 1 sec, even with large product databases.
- Uptime: 99.9% or higher.
Financial applications
- Transaction processing time: under 10 ms for high-frequency trading apps; under 350 ms for real-time payment processing solutions.
- Latency: under 0.5 ms for high-frequency trading apps; under 50 ms for real-time payment processing solutions.
- Transactions per second (TPS): up to 1.5M TPS.
- Data consistency and accuracy: 100% data integrity.
- Uptime: 99.99% or higher.
ERP systems
- Transaction response time: critical transactions (e.g., order processing, invoice generation) respond within 1–2 sec.
- Concurrent user access: handling hundreds to thousands of concurrent users.
- Data processing throughput: millions of records per hour.
- Report generation: sandard reports get generated within 2–5 sec; complex reports within 30 sec.
How We Engineer Applications for Optimal Performance
We integrate APM into every stage of our software development lifecycle, starting from initial planning and design to deployment.
Gathering app performance requirements
We engage all key stakeholder groups to gather accurate and achievable performance requirements:
- We interview business teams about expected response times, user experience goals, throughput targets, and overall business objectives.
- With technical stakeholders, we discuss scalability, concurrency levels, resource utilization, and the technical feasibility of meeting these performance metrics.
- To establish realistic and competitive performance standards, we also analyze industry benchmarks and evaluate competitors.
- We align performance requirements with budget limitations, ensuring optimal speed and availability without overspending. When necessary, we refine performance targets or evaluate alternative approaches to achieve the best results without overstepping financial boundaries.
Finally, we document all performance-related requirements alongside functional ones and establish clear KPIs to support accountability and transparency.
Designing a high-performing app architecture
We architect applications to align with the chosen performance requirements by:
- Segmenting the app into services, components, and layers and ensuring optimal communication methods and separation of concerns between them. While microservices are often praised for enabling high-performance, scalable systems, we recognize that a well-crafted monolithic architecture can sometimes be a more fitting solution, offering minimal latency and the highest data integrity (e.g., for banking apps).
- Selecting high-performing technologies like ASP.NET, Spring Boot, Node.js, and fast databases such as Redis and PostgreSQL. Sometimes, we combine several technologies for more efficiency, e.g., by employing Go or C++ for specific performance-intensive tasks or multiple types of databases to optimize different parts of the app.
- Planning caching mechanisms at various layers — client-side, server-side, and database level — to reduce load times and server processing demands.
- Balancing security and performance. Security features — such as encryption, authentication, and access control — often add processing time, use up CPU cycles, and increase memory consumption. We solve this by limiting encryption to critical areas, using lightweight protocols for frequent authentications, and implementing less intrusive monitoring during low-risk periods.
- Planning the infrastructure for peak load handling by leveraging automatic cloud resource scaling and failover strategies.
At the outset, we create prototypes and proofs of concept (PoCs) to validate performance assumptions, using the insights gained to refine both architecture and design pattern choices.
We then map the application’s architecture and dependencies, visualizing component interactions and analyzing the potential impacts of changes or failures on performance. With this data, our project managers and solution architects identify performance risks and develop mitigation strategies.
Programming for maximum efficiency
The main way to secure stable app performance is by writing efficient code. At ScienceSoft, we achieve this via:
- Following established coding standards and practices to deliver clean and concise code.
- Selecting business logic algorithms that meet core business objectives with maximum efficiency, considering execution speed, memory usage, and scalability.
- Applying precise thread management. For example, we can use synchronization mechanisms to ensure that only one thread can access a shared resource at a time. To avoid deadlocks, we use techniques like lock hierarchy or timeout locks.
- Utilizing resources efficiently. We optimize database queries, minimize network requests, reduce memory usage, and ensure timely disposal of objects and connection closure to prevent leaks.
- Designing integrations in a way that minimizes performance impacts. For connections with internal and external services, we employ strategies such as API requests batching, cashing, asynchronous processing, batch requests, load balancing, data compression, and rate limiting and throttling.
- Implementing circuit breakers, retries, and fallbacks to handle unexpected failures.
Performance testing
To ensure that applications meet performance expectations, we employ a comprehensive suite of load, stress, spike, endurance, and performance regression tests. To get realistic results, we populate test databases with data that mirrors production scenarios and use environments that closely replicate production settings, including hardware, software configurations, and network conditions.
Test automation is our primary approach for performance checks; however, we also utilize manual testing where automation is not feasible or when a more exploratory approach is needed. We integrate automated performance tests into CI/CD pipelines for continuous monitoring and early detection of issues and create reusable performance test scripts to enable ongoing performance monitoring even after the software is rolled out.
Regardless of the deployment strategy, we enable automated rollbacks to quickly revert to previous stable versions if performance issues are detected.
Performance-focused deployment strategies
After the app is launched, we maintain a seamless user experience while safely deploying new code into production thanks to the following strategies:
- Dark launches
New features are deployed to production but kept invisible to end users. Developers observe the feature’s impact on system performance without risking user experience.
- Canary releases
New features are gradually rolled out to small subsets of users to monitor performance and detect issues before broader deployment.
- Feature flags
Features with flags get turned on or off instantly without redeployment to test the performance of specific components under varying loads and quickly toggle features if they negatively impact performance.
- Blue-green deployments
We maintain two identical environments. At any given time, only one of them (“blue”) is live and serving users, while the other one (“green”) is idle. A new version of the app is deployed to the “green” environment for testing. If it is successful, the traffic is switched from the “blue” environment to the “green” one.
Regardless of the deployment strategy, we enable automated rollbacks to quickly revert to previous stable versions if performance issues are detected.