The challenge of using open source trial data to drive ML and what Lindus Health is doing differently

September 30, 2022


mins read

Machine learning is based on the idea of computer programmes learning from experience and improving themselves to make accurate predictions about a new event. Data is of course, central to this. We cannot expect to get accurate predictions without good quality data to feed into a machine learning algorithm. 

At Lindus Health, we use machine learning powered models to unlock the power of historical data from the clinical trials domain to generate innovative tools around all aspects of a trial. These include optimisation tools for trial design, recruitment and monitoring. Our initial approach to develop ML based tools was to search for open source data in the clinical trials domain. Using open source data was both a challenge and a great learning opportunity for us to understand what good quality data means in this domain.  

What are the public clinical trial data sources for ML/AI? 

Clinical trial registries are responsible for collecting and making clinical trial data available. The registries exist to provide complete, accurate and timely recorded study data and (“”) is the largest platforms globally doing this. contains over 400,000 study records from 220 countries where studies are registered to the database by the sponsor or the principal investigator of the study. This data can be easily downloaded meaning researchers have access to the largest clinical trial database to fuel machine learning projects. 

But is more data always better?

No, in short, it isn’t. The problem is that the amount and quality of the data has changed dramatically over time. It would be naive to assume that there are 23 years’ worth of quality clinical trial records as most of these early recorded trials include missing values and/or incorrect data.

Why has this happened? Looking at the number of studies registered during this timeline there is a dramatic increase in the number of registered studies starting from 2005 which is when trial registration became mandatory as part of publications by the International Committee of Medical Journal Editors (ICMJE). As Figure 1 shows, before 2005 the number of registered trials in was significantly smaller, compared to the following years, and there was less guidance on how to ensure high quality data upload.  

Chart, bar chartDescription automatically generated
Figure 1 Number of clinical trials recorded at from 1999 to 2021

Next, looking at the average missing value ratios over the years illustrated in Figure 2. This analysis takes into account the 24 study design features we used in one of our machine learning projects to predict early trial termination probability. In 2006, World Health Organization (WHO) made a statement that all clinical studies should be registered with a minimum dataset of 20 features followed by further expansion on requirements mandated by the FDA Amendments Act of 2007 (FDAAA). These increased requirements on registration processes substantially decreased the missing information rates over time, especially from 1999 to 2008 as displayed in Figure 2. 

Figure 2 Average missing value rates per year from 1999 to 2022 for the 24 features used in this study

Even with the introduction of these regulatory changes that aim to increase the amount and quality of the data, there will of course be cases where the data is inaccurate or useless. For example, a study that has recorded their status as ‘completed’ but in fact, they enrolled exactly 0 patients! This is why it is still a requirement to have manual logic checks over the features that are selected from this dataset to be used in ML/AI models.

How Lindus Health machine learning based tools will be powered in the future! 

Data collection in many clinical trials is a manual process and it is hard to automate prevention of missing values and incorrect entries. At Lindus Health we decided to collect the best quality data to enhance our future ML/AI products for a better trial experience for patients and trial staff. What this actually means is our product team takes great care to make sure data entry by clinical staff into our systems is clean, structured and most importantly, auditable. 

As we deliver clinical trials end-to-end, we have the unique benefit of having data collected across the entire lifecycle of a trial, giving us unique opportunities to connect these for further insights. 

This allows us to train our machine learning models on complete and accurate data and see the results of a trial in a much wider context than is possible with the publicly available or siloed datasets. Our mission in this context is to build ML/AI powered tools within our software platform by using our own data collected with the highest standards, and to use this for applications such as recruitment optimization or participant retention prediction tools.

View more