Software and More: Test Driven Chat Bot Development

Hi

Hello, what can I do for you today?
I'd like to know the meaning of life

I'm afraid I did not understand that. Can you, please, rephrase it?
Can you tell me the meaning of life?

I'm afraid I did not understand that. Can you, please, rephrase it?

Forget it

I'm sorry I could not help you. It seems my developers have some more work to do!

Just in case you haven't already noticed, let me state it here: we are living in the age of robots, and, more and more often, less than perfect chat robots (a.k.a. chat bots).

In the fall of 2017, I was about to make myself a hot beverage, exact kind yet to be determined, when my CEO walked into the kitchen.
"So, how was the conference?", I asked waiting for the water to boil.
"Very exciting, everyone seems to be talking about AI. It's about time we started putting some AI into our product offerings." said CEO while pouring milk into his tea.
"As a matter of fact, I've acquired quite a bit of knowledge in the area of machine learning. Maybe I can help.", I offered while stirring sugar into my coffee.
"Good to know!" said CEO and walked away with his tea in hand to yet another sales call.

One small disadvantage of working in a small company is that you have to make your own coffee. One large advantage of working in a small company is that, every now and then, you get to do it while chatting with your CEO.

Several weeks after this serendipitous kitchen chat, the same CEO asked me to start building a chat bot that will be able to take over some of our customer's load in supporting their customers. After a week of intense research, I decided to give Amazon Lex a shot and build a proof-of-concept chat bot using this new technology that Amazon released to general public in April 2017.

At this point, I had to make a critical decision: do I treat it as a throw-away prototype and just hack a bunch of code as quickly as I can, and hope it works? Or do I use the best way I know to build high-quality software product, and hope that the code sticks?

I decided to do the latter, i.e., to build a chat bot using test-driven development methodology. First obvious question was: is this even possible? After all, I'm building software that talks to real people. It must be pretty hard writing automated tests that are going to simulate real person and verify that the chat bot is responding with voice that a real person can understand.

The short answer: yes, it is possible. A longer answer: chat bot can be developed in test-driven fashion, but not by using end-to-end tests that simulate a real person. That approach would be doomed from the start, because feedback loop would be too slow and too unreliable. Instead, one has to take a look at the 3rd party tools one is using and make sure that the code one is writing is developed using TDD methodology.

In case of Amazon Lex, AWS has provided speech-to-text and text-to-speech capabilities, along with a rudimentary framework for driving a linear conversation flow via configuration. These capabilities require only simple configuration, and no coding, thus TDD does not apply to them. Manual testing suffices to ensure that configuration is done correctly.

However, any non-linear conversation flow in Amazon Lex does require writing of Lambda functions, which turns out to be best done using Python, and (I dare say: any) Python code can and should be developed using TDD.

Despite the fact that AWS Lambda functions are marketed as serverless magic, where developer doesn't even know where and how exactly the code runs, it turns out that writing unit tests for Lambda functions in Python is fairly easy, since Lambda functions are called by the container using a simple, well-defined interface, and the code has no dependencies on the container itself. So, one can test Python code implementing a lambda function by running it locally, e.g., from PyCharm IDE using Python's unittest module.

So, TDD of code driving the flow of conversation on Amazon Lex platform is pretty straight-forward, as long as the code can do everything by itself. However, unless one is developing a purely phylosophical chat bot, this code will sooner or later need to connect to some external system, e.g.,, a payment processing platform.

At this point, TDD gets a bit trickier, because one can never have precise enough specification of the external system's behavior and its APIs. So, if one purely assumes expected behavior and encodes it in the unit tests that then drive the code being developed, one can easily get into a situation where all unit tests happily succeed, but the integration with the external system doesn't work at all.

The way I approach this problem is by writing a minimal set of automated integration tests, which actually call the external system's APIs and pass the information in required direction. I make it very clear from the start that these are integration tests (by using the naming convention I had developed). So in the example of payment processing API, I may create one integration for a successful payment using a credit card, one for bank account, and several for unsuccessful payment attempts. Once I write just enough production code to have these integration tests succeeding, I will then encode the discovered details of the API behavior beyond the available specification in a separate unit test, which will likely have the same inputs and outputs as the integration tests, but will be mocking the actual calls to external APIs. Once I have the external APIs mocked, I can then add tests that will drive the production code to be more robust by testing secondary scenarios and corner cases.

This step of using integration tests to design the unit tests is the key to effective TDD of any code that integrates with 3rd party APIs: if you skip integration tests, it's going to be very hard to get the integration to work. But if you only keep integration tests (which is very tempting), you will get stuck with slow and unreliable tests, which will over a long haul become a black hole sucking more and more of your development efforts. A little overhead of building true unit tests the encode the essence of the external API behavior as you are extracting it from integration tests is a very good investment.

Once I have fast and reliable unit tests, I can run them all within a few seconds from PyCharm IDE, and get instant feedback if any of my further changes to production code ever accidentally changes the expected behavior of the code. This allows me to keep the code clean without fear that I might be introducing regressions.

Even though I don't run them often, and they are almost certain to get broken over time, I usually do keep the integration tests I wrote - they tend to be quite useful once in a while, e.g., when external API makes backward incompatible change, or if I find I need to make major refactoring which changes internal APIs. I just don't expect them to work without some manual set-up. This is especially true if the integration test is verifying some non-idempotent behavior, e.g., recording that a bill is paid.

An interesting challenge to this TDD-ed code base was presented by our decision, due to some advanced functionality deemed critical, to switch from Amazon Lex to IBM Watson Assistant some 6 months into development of the bot. Because the architectures of the two platforms are quite different, some of the peripheral production code, together with related unit and integration tests had to be thrown away. But since the core functionality didn't change, the core code could easily be reused. This is not to say that TDD alone was responsible for flexible code architecture, but the fact that I could incrementally refactor and re-organize the code while keeping it operational while instantaneously detecting any regressions allowed me to over time achieve flexible code architecture, albeit under constant pressure to demonstrate progress in supported functionality.

Switch to IBM platform did bring along an interesting challenge: unlike Amazon Lex's simple configuration, IBM Watson Assistant provides a fairly complicated proprietary programming language for managing the conversation flow, without much support for debugging. So, the development team (which in the meantime grew to 4 developers and a tester) soon found out that TDD-ing only Python code was no longer enough. Regressions were possible, and were indeed happening in Watson Assistant's conversation flow.

The team responded by building a simple test automation framework in Python where automated tests were passing text input to chat bot API over HTTP and verifying text output returned by chat bot API. This avoided the complication of dealing with voice, while still allowing the development in IBM's proprietary language to be driven by tests. I sometimes call those end-to-end (E2E) tests, even though that's not quite correct, as the voice conversion to text and back is not covered by those tests. And sure enough, over Labor Day weekend 2019, IBM managed to break Speech-To-Text component in production, which our E2E tests would not detect. But, at the end of the day, this breakage had nothing to do with the code my team has produced, thus having our tests detect them would not have helped much.

Regardless whether you call them E2E or integration tests, they do unfortunately have all the negative properties of E2E tests: they are slow and brittle. That's why they are best developed using TDD, because it implies white box testing and (by developer's inertia) minimizes the number of tests. Give a dedicated tester a task to create a suite of such tests in isolation after the bot had already been developed, and tester's inevitable treating of the chat box as a black box will ensure that you end up with a pile of tests that are impossible to maintain.

Over time, through extensive retrospection, my team has discovered that such E2E tests can also be used to collaborate on the specification of the functionality to be built, which is much more effective than any other specification format. This has lead us towards adopting Specification by Example.

Software and More

Saturday, September 14, 2019

Test Driven Chat Bot Development

No comments:

Post a Comment