Mikes Notes
This is republished from the ThoughtWorks Insights Blog.
Resources
- https://www.thoughtworks.com/insights/blog/architecture/how-can-on-site-servers-enable-richer-retail-experiences-part-one
- https://www.thoughtworks.com/insights/blog/architecture/how-can-on-site-servers-enable-richer-retail-experiences-part-two
How can on-site servers enable richer retail experiences?
By Chris Ford
Published: May 21, 2024
Part one: Business motivation
When people think about omnichannel retail experiences, they often start by considering in-store and digital as two distinct elements. But for a modern retailer, in-store is just another kind of digital, sometimes referred to as “phygital”.
It’s now a given that customers expect the store experience to be well-integrated into the retailer’s online presence and that they have little tolerance for services or data that are available in one touchpoint but absent in another.
Progressive retailers go further and take advantage of physical presence to address a customers’ needs contextually – someone browsing in a store requires different inspiration than the same person ordering from their couch at home.
As the relationships between different digital touchpoints become more complex, we need to develop new architectures that are resilient enough to keep stores running, powerful enough to support rich digital experiences and responsive enough to change that we can evolve them alongside the business.
Above all, we must not let the increasing digitalization of retail cause business fragility. Recent high profile cases have shown that when digital service is interrupted, it can cause massive business disruption.
My colleague Alessandro mentioned to me a system he was working on that deploys Docker containers to servers running in the stores of a large food retailer. They power point-of-sale systems that serve customer experience as well as store back office systems for processing payments and orders.
When I first heard this, it sounded like an odd mixture of technologies, as I had previously only encountered containers used in data centers or for local development environments. But when he explained the goals and trade-offs of the system architecture, I understood containerised on-site servers as a pattern that is likely to become more important as businesses bring the physical and the digital ever closer.
Digitally-powered stores
Chris Ford: Why does a store need to be integrated into a wider digital architecture in the first place?
Alessandro Salomone: Retailers have recognized the benefits of digitally-enabled stores for a long time now. One key motivator of a digital architecture is the ability to get to know your customers and to offer them targeted promotions based on their needs and tastes. This is especially true in food retailing, my client’s sector. Customers may consume products in more than one of your shops, so having a system that aggregates customer information will ensure a consistent and personalized experience wherever they shop from.
And of course, you want your infrastructure to bridge the online and in-store shopping experiences, with the opportunity for the user to browse products, electronic receipts and benefit from personalized vouchers for the products that they are likely to love and buy.
In a post-Covid environment, retailers are actively looking to experiment and cultivate customer engagement and are particularly focused on attracting customers back to stores. Point-of-sale systems and in-store digital experiences are important investment areas to achieve this.
And on a practical level, as your business model gets more complex, you want to make sure that business processes in all your stores are supported by an infrastructure that ensures a consistent experience and implementation of such processes.
Lastly, collecting data from your stores can help you gather insights on your business in terms of fine-grained financial projections. You will have data coming from all your stores to discover the seasonality of your offering as well as to gather visibility on product waste, and appeal of your product to regional markets, just to mention a few. All this can lead to better targeted marketing and promotion design, that can help you strengthen the foundation of your business and let it grow and adapt to the ever changing landscape of the market.
Running servers in stores
CF: What reason do retailers have for installing local servers in their stores? Wouldn’t it be simpler to just talk directly to a central server running in a data center?
AS: Talking to a central server would be indeed simpler. And that's how you want everything to appear to your customers and attendants when they are shopping or operating the store — everything is connected and it all just works. You want to launch new products and promotions from a single, central point, and ensure all stores receive the same information at the right time and, as we mentioned earlier, you want to be able to identify your customers and offer them tailored services and products.
In practice, there are a few technical constraints to consider. Running a business in a store means having several classes of in-store devices that operate together to provide a seamless experience to the customer and the store personnel: security cameras, handheld devices for inventory and price labeling, scales, payment checkout tills, not to mention back office applications for stock management and solutions for walk-out checkout customer experience.
The overriding business consideration is to ensure availability of critical services that support an uninterrupted shopping experience. The most likely disruptive event is the loss of network connectivity from the store to the central services. Crucial customer flows, like checkout and payment, must be guaranteed in this scenario. The benefits of digitization count for nothing if our technology introduces fragility that threatens business continuity.
Beyond plain checkout, how can you guarantee that the customer who just grabbed a three for two deal on roasted coffee beans can enjoy their anticipated discounts even when the internet connection goes down as the cashier is scanning them at the till? All product, price and promotion information should be readily available. Perhaps a network interruption means you can’t access promotions based on the shopper’s personal history, so how do you gracefully degrade service and still provide product level discounts?
What if your customer is entering your walk-out store by tapping their loyalty card at the gates and there’s no, or slow, internet connection? You don’t want to make them wait: they should be swiftly welcomed by an opening gate, while the system will figure out how to identify them and eventually charge their preferred payment method as they walk out.
A store server that caches the critical business information and is able to autonomously implement business flows acts as a buffer between the store and the central services, ensuring that local store operations can go on undisturbed, and that any changes will be synchronized as soon as the connection is restored.
Another motivation for having a local server is the local management of store-specific data. The device that prints the tags to be placed on the shelves shall show the same prices and promotions that will be printed on the customer’s bill. It’s standard for retailers to vary pricing based on store considerations like location and nearby competition. But what about dynamic factors, for example whether a store has excess stock of a specific good that should be effectively depleted?
Be it a consumer electronic product imminently obsoleted by an upcoming new model, or a bottle of milk soon reaching its expiry date, the store could apply a dedicated promotion campaign targeted at reducing stock, that is going to potentially cause waste, while making customers happy. This requires the in-store personnel to be able to set up a campaign for their store from the back office application, overriding set prices and promotions, while ensuring that all the devices in store are ready to show the same information.
CF: Doesn’t this contradict the prevailing wisdom that businesses should migrate to the cloud?
AS: This architecture is not about going back to on-premise data centers. In the terminology of cloud computing, this is more like an edge computing architecture – augmenting cloud capabilities with compute situated closer to the customer. Edge computing isn’t in competition with the cloud, but a complement to it. Your favorite cloud provider almost certainly has a story around edge computing, because the advantages around resilience and latency are compelling for some use cases.
That being said, having to manage and deploy to on-premise machines and devices is more complicated than operating in a purely virtual environment. Perhaps we can talk later on about the engineering needed to pull it off.
CF: Could I think of running retail software on-site servers like installing apps on my mobile phone? A lot of the time, I happily rely on websites that are hosted in remote data centers, but for some things it’s better if the program is installed on the device in my hand.
AS: Correct, websites are fine if you don’t mind being interrupted if you lose mobile phone signal. But when you run an app locally, you get better performance and integration with the rest of your mobile experience. With an app, depending on what operations you want to do, some of them can still be performed even when you have an intermittent internet connection.
An example you may be familiar with is Google Docs. If you install the app on your mobile it lets you continue to work even when you are offline, and transparently syncs with the server when you have connectivity again.
In the same way, having servers in the store means having your business data and logic here, next to where you need them, without being dependent on something external to operate. Having smart devices communicate with a local server that orchestrates your business processes allows your stores to operate independently, while remaining aligned with the overall business implementation and its evolution dictated by the central services.
CF: Can you give me an idea of the scale of the architectures you’ve worked on? How many stores, how many services etc?
AS: In one of our projects we worked with a client supporting 40,000 devices in more than 2000 stores. Each device would be configured to receive daily business data updates and to send near real-time telemetry and analytical data to the central servers. At this scale, device failures and network interruptions happen all the time, so your system better be prepared to deal with them.
Related use cases from other industries
CF: Do you see on-site servers being used in other industries or contexts?
AS: On-site servers are a concept applicable in several other industries. This architecture brings advantages wherever there is the need for data and operations being available for highly-critical business processes or in businesses where good network connectivity is an inherent issue or cost.
One example that we have discussed with clients is hospitals. A modern medical center is a sophisticated information technology hub. It runs various software that manage sensitive clinical as well as operational data. Applications running on local servers offer a way to support these applications and also to gracefully upgrade them. Tolerance of network failure is essential as the hospital cannot stop treating patients in the case of an outage.
Another is cruise ships. The digital expectations of passengers are rising. Entertainment, schedules, menus, communications and even digital room keys are becoming an essential part of a luxury experience. On-board servers give a way to support these experiences without relying on connectivity that is often unavailable at sea.
Thank you Alessandro for relating your first-hand experience developing software for in-store servers and explaining why retail businesses are motivated to adopt this architecture.
In the next part of this series, I talk with Alessandro about the engineering challenges that come with developing such a system. Deployment, testing and observability require some different approaches relative to systems hosted in data centers.
By Chris Ford
Published: May 21, 2024
Part two: Engineering
In the first part of this series, Alessandro Salomone described why retailers turn to running servers in their stores. Local servers provide a platform for richer digital experiences for customers, more sophisticated back office processes around data pricing and payment and above all enable resiliency in the case of network failure.
In this follow-up piece, we dive into the engineering you need to do it well. For example, we discuss challenges like testing and deployment and find out why containerization makes this all easier and more robust.
Data synchronization
Chris Ford: Doesn’t a local server introduce problems of data synchronization with the central server? How do you handle that?
Alessandro Salomone: Yes, and no. A setup in which a local server is present helps reduce the volume of data exchanged with the central servers, avoiding each and every device communicating with the servers. It also makes operations faster and more reliable, because the devices benefit from data cached in the local server, on a local network that is more stable and faster than the internet.
The presence of a local server increases the complexity of the architecture, which has to be designed to ensure eventual synchronization of the information in store and on the central server. Temporary central server unavailability or unreachability, due to heavy loads or connection issues, can cause temporary desynchronization between the two servers.
For this, we designed the servers to communicate using event queues in both directions, allowing the local server to pull events from the central server and vice versa. Events can be anything from the broadcasting of a product update, the availability of a new promotion campaign, a device reporting a failure or a client checking out. Queues allow the servers to catch up on missed updates caused by a temporary disconnection by replaying all the missed events.
Another challenge worth mentioning is sizing on-premise infrastructure. Load is highly unpredictable in stores. The amount of footfall and therefore digital traffic varies a lot between stores and changes at different times of the year. It’s a good idea to make in-store infrastructure as horizontally scalable as possible so that you can add more capacity where you need it. You don’t want to rely on single large machines that will take the whole store down if they are overloaded, while servers in other stores are underutilized.
Testing and deployment
CF: How do you deploy updates to these local servers?
AS: Local servers were running their logic in containerized microservices. This would allow the development teams to publish new versions of the microservices to a central container registry and wait for the local servers to discover and pull the new container images.
The local servers would have their container orchestrator to check for images on a frequent schedule, to ensure new features and bug fixes could be rapidly deployed to all stores.
When you design your container orchestration, be sure to consider the resource constraints of your in-store hardware. Kubernetes and its ecosystem of tools like Argo CD are powerful and might do what you need, but they are also primarily designed for data centers. You have to find a balance between achieving a lightweight solution and avoiding the temptation to roll your own infrastructure orchestration.
CF: How would you test such a setup?
AS: The containerisation makes testing this setup a lot easier than if we were installing applications directly onto local devices. Testing can be done by replicating the containers and their connectivity on a virtual machine and running manual or automated tests, for example using a continuous integration pipeline. In our case we needed an extra step, which was to virtualise the in-store devices that the microservices solution talks to.
We decided to design the microservices architecture to ensure that every in-store device would have a corresponding containerised microservice that would abstract the device’s data and control interfaces. With a test double for each device microservice, it is possible to replace the original microservice in a test setup. The test containers virtualizing the devices would feature a control port to ensure device data and behavior could be simulated, so as to implement all the required test scenarios, and in fact drive the development.
CF: When deploying to a cluster of servers in a data center, you might test the release with a small group of servers. Is there an equivalent for store servers?
AS: Definitely — in our experience, we built a thin layer on top of our container orchestrator in order to implement a pilot and canary release management. A specific store, or a selected set of stores could be identified as a deployment group for “beta” releases of new software. This would happen regularly and automatically so that all the new features introduced by the new software version would be tested in a controlled environment, and end-to-end customer or store attendant feedback would be collected before releasing the version globally.
Unpredictable environments
CF: A data center is a very controlled environment. A store is less so. What does support look like when you have servers running out there in the world?
AS: To start with, a VPN LAN is a must for security. Incoming connections should be blocked and outgoing connections should be allowed based on an allow-list. This helps to protect your in-store systems from external threats.
Proper, dedicated support channels should be available to store attendants and managers to quickly communicate issues and get them solved. For this reason a first-level line of support would be available to ensure that any store issue would promptly be recorded and redirected to the correct department and, eventually, development team. This requires the development teams to build support guides that help the first-level support line to address the most easy-to-solve issues, or, in case of more complex or unknown issues, to know what information on the user’s experience and actions to collect and pass to the team, that would speed up its investigation and fix.
A ticketing system would be used to record the user’s issue and automatically reach the development team most likely able to contribute to the solution, by ringing the phone and sending an email to the team member on the roster: just a few minutes and the developer would be in conditions to get in contact with the store personnel.
This is where observability becomes essential: support is part of the product development process, and it has to be thought through from the very beginning. Building and running a system that can collect health metrics on the microservices running on the local server, and possibly from the connected devices, is an effective way to be able to observe what is happening in store and quickly spot issues. Also business metrics on the user flows are essential to understand where users (be it customers or attendants) are getting stuck or are experiencing problems.
With appropriate observability tools, like logs, audits and dashboards, the development team can relate the actions of the people in store with the timestamped information coming from the in-store telemetry, thus gaining that visibility that would allow them to spot the lamented issue. Here it is important to ensure the data is collected and presented in a way that can easily tell the story of what has happened.
In urgent or complex cases, it would become more practical for the development team to directly call back the store and ask them to re-enact the situation, in order to reproduce the error. Thanks to near-real-time telemetry, the developers would be able to see what was happening in store and in the software running there on the local server, and promptly release an emergency fix that would be distributed to all the stores within the hour.
High-traffic periods
CF: I recently spoke to our colleague Glauco about building retail systems to survive Black Friday and Cyber Week. How does this architecture stand up to demand in those busy periods?
AS: A nice thing about this architecture is that store systems are designed to run independently, so Cyber Week and load from elsewhere isn’t really a problem. It’s still very important that everything stays up though!
It is in these periods of the year when the advantages of this distributed architecture are especially visible. One of our customers stated that more than 50% of their yearly revenue would come from Christmas shopping: a period in which the stores need to be fully operational at their maximum capacity, without disruptions.
And of course, information like price changes and promotion campaigns for these weeks need to be readily available before the customers flood in the stores. That’s where the idea came to ensure such data to be delivered in the stores even a week before the new schemes would be effective. All information, accompanied by a proper start and end date, would be readily available and usable by the store server and all the other in-store devices, and active only in the expected time period, even if connection to the central service was a bottleneck.
At the end, the central server acts as a business orchestrator, distributing information and stating the expected behavior, as well as a collector for the observability of the business for support purposes and business intelligence insights. All this while the heavy load is handled seamlessly at the local level.
Thank you Alessandro for your insights running business-critical services via in-store deployments.
These kinds of architectures support next generation in-store experiences, but they also give rise to distributed systems challenges not found in traditional data center based systems.
At Thoughtworks we’ve observed local deployments supporting rich in-store digital functionality as part of a movement back to on-premise, not as an alternative to cloud computing, but as a complement to it. We predict that as in-person and online digital experiences become ever more closely integrated, this trend will continue.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.
No comments:
Post a Comment