Generativity / Data


Data, Data, Everywhere.
And Nor Any Drop to Drink.

Laborious: A Robust Laboratory/
Symptom Dataset with AI Diagnostic Inference

Subscribe button

It’s quite surprising just how much readily available scientific data one can freely access nowadays. A myriad of taxpayer-funded initiatives is available through NCBI (National Center for Biotechnical Information) for direct interrogation. These include the ubiquitous PubMed, PMC and MeSH portals, in addition to more discreet gateways such as DbVar (formally DbSNP), Taxonomy, ClinVar, and Gene. As government-funded entities, these portals are open source and there is no charge for their usage.  What makes these resources especially snazzy to the informatically inclined scientist is the availability to access this information programmatically through the NCBI Entrez E-Utilities toolbox. These are a collection of commands (usually called handlers) that allow a coder to integrate aspects of the NCBI universe into their own programs and act upon them in an analytical manner.

For example, in my program Opus23, a natural product listed as having some effect upon the expression of a specific gene will provide the doubting Thomas the option of clicking through and viewing the actual study (or studies) that have made that observation. This is accomplished by pinging the appropriate E-Utilities handler with the PubMed ID number. This also takes place behind the curtain that separates the Opus23 users and editors. For example, when an editor adds a new SNP to the Opus23 human-curated database, they plunk in a few basic details (such as the reference sequence number; the ‘rs’ that is every SNPs home address) and the editor pings a variety of services (DbVar, ClinVar, Pharmacogenomics, Gene, GWAS, etc.) and proceeds to build up the entry. This is known as ‘scraping’ the data.

In addition to NCBI and other academic and governmental data pipelines, a surprising amount of data is now accompanying original research, often as nothing more than the lowly comma separated value (CSV) file used by most spreadsheets. Most of the time these data are just the results of a specific operation or experiment and have only limited usefulness, but when provided as a result of thorough review articles or meta-analysis, these can be quite helpful. Finally, we should not underestimate the willingness of the average scientist to share data. Very often a respectful inquiry will result in an act of great generosity; in a few instances not only was I provided with the requested data, but the author took the time to perform regressions and other types of cleanups, simply out of professional courtesy. The interaction network data behind the ‘Radiance’ microbiome app I wrote about a few columns back was the result of just such beneficence.

Finally, we are left with the least exciting option and outcome: We have a plan for some sort of cool, awesome analysis but no form of relevant, centralized data exists. Then, sadly, it’s time to roll your own.

I ran into this roadblock a few years ago when I became interested in using the Naïve Bayes Classifier AI algorithm to allow laboratory outcomes to probabilistically infer specific diagnostic odds.  What made this an especially attractive idea was the subsequent link that could be drawn between the resulting diagnostic outcomes and the constellation of symptoms that accompany them. Thus, by inputting a selection of labs and their values, the user could calculate the odds of pathologies and then be provided with the relevant symptom cluster that should be anticipated from that outcome.

But back to the original problem. There is no centralized dataset of lab values that are classified according to their odds of implicating a specific pathology. It’s not that the data is non-existent; it certainly is out there. It’s just not coded in a way that allows for probability type analysis. Ironically, sometimes the data is too good, too specific, for this type of purpose. For example, if we were to meet in a hallway and I were to inquire about patient Joe Blow’s diabetes, you might respond by saying ‘His HgbA1c is 10.8.’ Or, you might simply say ‘His HgbA1c is awfully high.’

Since it is at heart a classifier, the Naïve Bayes algorithm prefers the latter depiction. Frankly, as a busy human, so do I.

So, before we can even begin to design our lab AI inference engine, we’ll need to roll up our sleeves and get into the trenches. I’ll spare you the gory details, but thanks to several of the University of Bridgeport Naturopathic Pathfinder Scholars, and most significantly to my friend and colleague Valentin Prisecaru, over the last three years we built a robust database of over 5000 individual data points linking lab outcome classes to specific pathologies. Then I coded a script that allowed for simple filtered queries via a smart table, or going deeper, queries with specific lab outcomes.

Here’s how you can play with it.

The Laborious App

Figure 1. The Laborious Opening Screen

You can run this script (Laborious) from any laptop, pad, or desk computer (sorry, not great on smart phones) by pointing your browser (extensively tested on Chrome) here: https://www.datapunk.net/tlfd/.

To protect against tiresome robots, you may have to prove you are human by simply moving the slider’s blue dot to the middle of the range. If you’ve been here recently, you’ll already have the access cookie and will just be directed to the main splash page. From there click on the link that reads Laborious (Lab AI Inference) under the category AI/ML. That will take you to the front page of the app, which shows the data is a ‘smart table’ (Figure 1).

Figure 2. The Laborious Smart Table Showing Filter Function

The smart table gives you the ability to look at the data behind Laborious. It is broken down into several columns (Pathology, Lab, Observed Value and Significance.) Each column is sortable by clicking on the column heading. There are a total of 4843 individual records, although only 15 are shown on a page. You can increase the amount shown, up to 100 entries at a time. You can filter results by using the Search field and typing in a full or partial search term. For example, if I type ‘anemia’ into the search bar, the table shows only entries with that text fragment in their field (Figure 2).

As you can see, filtering by ‘anemia’ the number of records shown now drops down to 83. Much more manageable! As a data analysis tool, the Laborious smart table provides enough function to be a standalone app on its own. However, we’re just getting started. 

Figure 3. The Laborious Query Form

Click on the link in the upper left of the screen that reads ‘Run Query.’  You’ll be transported to the Laborious AI query form (Figure 3).

Choose up to ten lab biomarkers to include in your analysis. Entries will autocomplete as you begin to type. For each biomarker, select the appropriate result range, and if desired assign an additional ‘weight’ to the observation.’ Although not mandatory, assigning a weight value tells the AI that you think this biomarker is especially noteworthy. In our case, we’ll run a query that tells Laborious that we have a cluster of biomarkers that include a low hematocrit, a high white blood count (WBC), and a low hemoglobin.  We’ll leave the weight values at 1 for this test. After inputting our info, we then press the ‘Run Laborious’ and through the magic of television, Figure 4 appears.

Figure 4. The Laborious Run Query Results Upper Panel

The results screen is divided into two parts, an upper and a lower panel. The upper panel has a small table that just reiterates your search criteria and has a ‘Back to Query Link’ that returns you to the prior form, but with your original criteria maintained. The next table, ‘Diagnosis Probability by Quantile’ delivers the real punch line. As you can see it is a table of five columns containing a variety of diagnostic possibilities, and their relative probabilities displayed as a number between 0 and 4. The columns are quantiles (statistical cut point intervals) of the results, decreasing in probability from left (quantile 4) to right (quantile 0).  Under these conditions, inflammation looks like the winner, although there are elements in the next quantile with significant possibilities as well.

I’m sure there are many readers who could have deduced many of these possibilities, but I like to think of an app like this helping me to define the ‘negative state space’ of all possibilities, i.e., the outcomes that I would not have thought of based upon my own sense-data. In that sense AI provides a glimpse into the invisible world of our limitations. And anything that decrease the chances of an error of omission is okay in my book.

Figure 5. The Laborious Run Query Results Lower Panel

Scrolling down, we see a heatmap based on the symptoms one might be expected to be presented with, given the resulting diagnostic possibilities (Figure 5).

This looks much better on the computer program, than in print because, as a heatmap, the values are colored on a glorious gradient from autumnal yellow to red. Now we can compare the client’s presenting symptoms to see if they correlate more closely with a particular diagnosis.

I’ve shared Laborious with a few colleagues, and their feedback has been quite positive. One comment that I thought was particularly heartening came from a former protégé, now in a solo private practice in a far-off state. She said that this app made her feel like she had a colleague in the practice that she could bounce ideas and possibilities off, and perhaps just a bit less lonely and isolated.

Peter D’Adamo is a physician and distinguished professor of clinical medicine at the University of Bridgeport School of Naturopathic Medicine. His New York Times bestselling books have sold over 8 million copies and have been translated into over 75 languages. He is the developer of the acclaimed Opus23 genomic software suite and a variety of other generative apps that can be explored at www.datapunk.net.  In his spare time, he brings old VW Beetles back to life at his garage on www.kdf20.com.