Eight months after our first co-design session, we finally went live with a true AI-first solution: Uncover. Caroline and Ingrid’s vision to build an AI-powered litigation case management software has a real impact on how lawyers work and is also a real technical challenge (Read more about Why We are Thrilled to Co-build Uncover). This product not only pushed the boundaries of current Natural Language Processing (NLP) AI technology, but it also has to meet the high expectations of lawyers who need to trust this product from day one.
Building Uncover was anything but a linear path. We encountered many challenges and we also learned a ton. We’re super grateful to Ingrid, Caroline, Imre, and Augusto for trusting us to build this product and new company with them! And in the spirit of sharing with those of you also building your AI products, we’d like to share a glimpse of what it takes to build an AI-first product and our lessons learned.
Every product has its own challenges attached to it. Maybe in hindsight, building software that allows you to search, structure, and create a timeline based on a pile of documents would seem simple. But there’s more to it than meets the eye — Uncover is an innately complex product.
In order for the users — very busy litigation lawyers — to want to start using the product, they need to fully trust the software. Not only does the AI have to work at a high level of recall and precision, but the product also has to be super intuitive, secure, and integrate seamlessly into the systems and workflows of the law firm. Only then will the lawyers begin to consider adopting this product.
Specifically, building a product that can be trusted by lawyers required (1) the application of the most recent advancements in text extraction and named entity recognition (NER) and (2) getting the UX right to make the user feel that they’re in control. Throughout our 12 years of experience building products, we’ve never seen an interplay between UX and AI this important. Both of these aspects added to the complexity and challenges of building this software, and we’ll break it down for you in the following section.
The Challenges of an AI-first Product
What makes this product so complex is that we needed to combine several different AI techniques such as optical character recognition (OCR) and named entity recognition (NER). Particularly, what turned out to be challenging were date extraction, document classification, and designing an intuitive UX.
Date Extraction — Tapping into the AI community and the challenge of hosting in the cloud
To create insightful timelines based on case file documents, we needed to be able to accurately extract time events, such as dates, from texts. To train a new machine learning (ML) model to recognise all the different ways and contexts in which time events can be written, you would need huge amounts of tagged texts, which we didn’t have. Fortunately, the AI community was our friend here, and it turned out that work on NER had by then seen a handful of models that included dates in their entity categories. We found Hugging Face’s Flair model to be the only one that performed well enough for our use case, so we decided to go with it.
Unfortunately, hosting it in the cloud brought its own challenges. It’s big, slow, and requires a great deal of computing power to do the job. With the funds of an early start-up, this is not usually a luxury you have, nor did we. There is a lot we can write about this topic, but it suffices to say, AWS’s offering is not where we hoped it would be when it comes to hosting big models in a serverless way (meaning you don’t get billed for all the hours the machine is not processing anything). In fact, at the time of writing, we are working hard on finding a better and more creative solution that makes use of available resources in a smarter way and requires fewer trade-offs on the fronts of pricing and user experience.
Document Classification — Getting the right dataset and training it early on
For Uncover, we understand the importance of providing lawyers with clear insights into the composition of their legal cases. That’s why we developed a feature to classify each uploaded document according to category, such as agreements or court documents. However, training a classifier for this task posed a challenge: we needed a large, labelled dataset, specific to the legal domain, and nothing publicly available fit the bill.
To overcome this challenge, we collected and scraped data from various sources. We were also careful to avoid overfitting, especially given the small size of our initial dataset. For instance, we used techniques like k-fold cross-validation and kept documents from the same case file together to prevent the model from learning case-specific details.
An interesting finding for us was that during the process of testing the classifier, we noticed that experimentation results wouldn’t always be the same as user findings. While standard metrics like F1 score, precision, and recall are useful for evaluating the performance of a classifier, they don’t take into account the user factor. Call it an instinct or a feeling, but a user is concerned with what seems logical from his or her perspective. One misclassification that is unexplainable might just be worse than ten misclassifications that make sense to the user. A model that is “technically correct” but makes seemingly “stupid mistakes” might not be trusted as much.
To address this issue, we decided to include “user-instinct” adjusted metrics in our evaluation process. These metrics helped us build a model that users not only find accurate, but also intuitive and trustworthy.
Designing an Intuitive UX — The precision problem and giving users a sense of control
As discussed before, AI is only one part of the equation. To build users’ trust, we need to get the UX right. The UX became more important when we realised that it wasn’t feasible for the AI to be 99.9% precise. After initial training, there were some encouraging results with a 90 to sometimes 95 percent accuracy in AI predictions. For many other products only a few months in development, this would’ve been amazing. But for Uncover, it wasn’t. If one out of ten documents is assessed incorrectly, lawyers interviewed stated that they would rather “Just do the work myself”.
This is when we decided to shift focus to the UX and UI — stepping away from the solution and focusing more on the problem. We realised that the users wanted to have more control over the results they are getting. Instead of “leaving it all to the AI”, we also wanted the users to have more interaction with the system, playing a more central role. By giving control back to the users, we aim to stimulate the trust lawyers place in the system. Specifically, we decided to design an intuitive UX/UI that made it easy for the users to verify whether the AI is right and to correct it if necessary.
For instance, in terms of document categories, we’ve designed the UX so that it’s very easy for the user to confirm or deny the predicted document category by either selecting a checkmark or selecting a different category. It’s easy and intuitive for the user, and it allows them to feel like they have control over the results.
Similarly, for the timeline feature, the user has the possibility to easily adjust predicted dates, change the automatically provided descriptions, or even completely add or remove events. By clicking on an event, the user is taken directly to the place in the document where the date and context were taken from, so they can assess whether the results were accurate or not. If the AI suspects there is date information, but cannot distinguish the date exactly (e.g. when information is missing), the UX will present this event in a different colour to indicate it needs help from the user (see the image below). This way, the user can be confident they will not just be presented with a ‘best guess’, but will be consulted when necessary.
Instead of trying to achieve unrealistic goals with the AI, we made sure the application helped the end-user out when the AI wasn’t able to. We gave the users an increased sense of control over the results which in turn, stimulates trust in the product. As a bonus, the input from real users helps train the AI in making better decisions, getting us ever closer to that fabled 99.9%.
We’ve definitely learned a thing or two throughout our journey building Uncover. Here are a few of those which we think can be useful for those of you building complex software products:
1. Think about data and scalability from the very beginning.
With AI products, the lack of data is a problem the majority of the time. As with Uncover, getting labelled data was particularly challenging. So even before you start, you need to figure out how to get the data and whether it’s available and accessible. You also need to think about scalability from the get-go. Build a scalable infrastructure, as you need the product to easily scale toward multiple customers and users. Read more about how to nail the balance between fast delivery and scalability in this article by Rieks Visser, our Product Delivery Manager and Agile Coach.
2. Don’t get carried away by the AI.
Yes, AI is central to every AI-first product. But when it’s apparent that the level of AI precision for full automation is not realistic, you should instead think about how an interface can support the end-user to reach their goal. You need to learn to love the problem instead of the solution. Don’t waste time trying to meet unrealistic goals. What’s most important is designing a product that a customer needs. Aside from the AI, transparency and how you design the UX can help you go a long way.
3. Put yourself in the shoes of the users.
At the end of the day, it boils down to the users. If the users don’t want to use the product, then the product has no value. Consistently ask yourself, what will the users think of this? Even better, ask them directly. It is crucial to collaborate closely with your launching customers. Always test and get feedback from them — this is how you can know what they truly need and how they perceive the product. Just like the “user-instinct” example, without it, we would’ve ended up developing a model that users won’t trust. In the end, with Uncover, our work wouldn’t have been possible without the launching customers and the time they spent giving us valuable feedback.
. . .
This article was co-written by: