In: Enterprise Biology Software, Version 6.0  © 2006 Robert P. Bolender                            

 

Enterprise Biology Software: VI. Research (2006)

 

Robert P. Bolender

Enterprise Biology Software Project, P. O. Box 303, Medina, WA  98039-0303, USA

http://enterprisebiology.com

 


Summary

 

The Enterprise Biology Software Project explores challenging questions in the life sciences by looking for answers in the biology literature.  Since many of these questions can be translated into mathematical puzzles, they can be solved – for the most part - by generating empirical equations.  In turn, these equations can produce new and more difficult puzzles often extending to the mathematical core of biology.  For example, the progress report this year describes how we can solve one of these harder puzzles by first separating two interacting complexities and then unfolding each complexity in its turn.  In effect, the question: “How do we reverse engineer biology?” becomes a double puzzle: “How do we separate two complexities (methods and biology) and how do we unfold and refold them?”  A solution to the methods puzzle becomes a Universal Biology Database, which in turn can be used to solve the biology puzzle by characterizing biological phenotypes with equations.  Such phenotypes can be represented either as a single equation or as a stack thereof.  When summarized across biology, these equations offer a glimpse of the core by revealing the mathematical organization of the parts as a biological blueprint.  There is, however, one small surprise.  In contrast to the widely held view that biology is largely a nonlinear system, most of these phenotypes turn out to be linear.  This finding is most welcome in that it greatly simplifies the task of reverse engineering biology.  By including a query by example interface, the 2006 software package offers ready access to this new data-driven biology and may become one of our technology tickets to the future.  Imagine – if you will - a time when each research paper becomes a puzzle with a unique mathematical solution, where biology operates according to a rulebook, and we can read from that book.  Interested?  I invite you to run the software package. 

 


 

Introduction

 

Biology plays by the rules.  By discovering these rules and also playing by them, we improve our chances for success – in whatever we attempt.  This defines a central strategy of the Enterprise Biology Software Project.  Indeed, this process of discovery based on solving mathematical puzzles is wonderfully simple and can be applied by everyone.  All we have to do is follow the clues to find the rules that solve the puzzles.  Each solution, of course, generates a new collection of clues and the process continues from one puzzle to the next.    

 

This rule-based approach has yielded two key pieces of information.  We now know that repertoire equations can unfold complexity into a well-defined order, wherein these power equations have coefficients of determination approaching one (r2~1).  Moreover, these equations often display an exponent a (Y=bXa) that also tends toward one.  This means that the repertoire equations effectively become linear (Y=bX). 

 

Since the repertoire equations define order in biology as the proportion of two parts X and Y, we can assemble enough information to begin the process of reverse engineering biology.  Moreover, we can design a Universal Biology Database with virtually unlimited scaling properties by simply arranging the ratios of the parts (Y/X) in numerical order and assigning them to decimal steps.  When fitted to regression curves, these data pairs become decimal repertoire equations.  The database becomes universal because it can accept data from all disciplines capable of forming ratios of structures, which of course includes molecules.  By storing data from many different disciplines in the same database table, the data become integrated - mathematically – across disciplines, animals, settings, et cetera.

 

The software is full of surprises.  When running the program from tab 8 of the Universal Biology Database, for example, we can easily compare the proportions of the same two structures in control and experimental settings.  I was surprised by how often controls and experimentals share the same decimal repertoire equation – even when the absolute amounts differ.  This tells us that biology prefers to make more or less of the same thing, rather than making something different.  Changes in the proportions of structures, however, can be readily found in development, disease, and aging.  Perhaps the most reassuring finding of all, however, is the fact that many different labs routinely produce the same results (i.e., the experimental data share the same decimal repertoire equation(s)).  This shows that both the methods and the researchers applying them are indeed capable of generating reproducible results. 

 

The new software package comes with a new learning curve, but one that has a very gentle slope.  Specifically, by adding a “query by example” interface we can write and send complex instructions to the database.  This provides full access to the information with great ease and flexibility.  Simply working through the examples should be all that is required to become comfortable with the new format.

 

Although we’ll consider these topics later in this report, a brief introduction here may be helpful.  Why do we want to reverse engineer biology?  The answer is painfully simple.  It encourages biology to become a data-driven science.  Recall that chemistry is a product of physics and that biology a product of both physics and chemistry.  Chemistry, for example, is very effective as a data-driven science because it plays by a set of stoichiometry rules.  We know these rules and can use them to explain how small parts (atoms) are connected to form larger molecules and compounds and to write balanced equations for chemical reactions.  In other words, we have access to the mathematical core of chemistry.  Do analogous rules and balanced equations exist for biology?  Yes, of course they do.  The only difference is that the rules of biology are much better hidden than those of chemistry and must be extracted from the research literature with the help of mathematics and technology.  What evidence supports this claim?  Biological stereology can provide the balanced equations and a Universal Biology Database the stoichiometry of the parts.  In effect, we can produce a biological blueprint that shows how the parts of biology larger than molecules are connected by rule.  As such, this table provides the first of many new access routes to the mathematical core of biology – exactly what we would expect from a data-driven science.     

 


 

Methods and Results

 

 

The main purposes of the Progress Report this year are to introduce (1) a new interface for the literature database and (2) to explore an engineering model (forward and reverse) for biological research.  As always, the best introduction to the software package comes from running the programs. 

 

 

Extended Database Model

 

The central feature of the first universal database was a single table that could accommodate both control and experimental data – even when coming from many different disciplines (Bolender, 2005).  Effectively, it defined a simple - yet powerful - way of integrating diverse data within and across research fields.  Unfortunately, most of the details – defined as data entry fields in the original databases - were no longer available.  The new database remedies this shortcoming by connecting all three databases into a single operational unit. 

 

 


By adding a “query by example” interface to the Universal Biology Database 2, even the beginner can quickly learn to write intricate SQL (structured query language) scripts and submit them to the database. 

 

 

 

Enterprise Biology Software for 2006

 

 

This year the software is being distributed on a mini CD, one that fits conveniently in a standard envelope.  The CD includes a relational database, a runtime database engine, and a small collection of programs and documents.   

 

 

The main menu of the program includes the progress report and an index of programs.  Click on an item in the list to view it. 

 

 

1.     Progress Report: A pdf file (this document).

 

2.      Universal Biology Database: The database folder includes eleven tabs - six of which can be used to run programs.  A central feature of the new software consists of a query by example interface.  It allows us to define a database query by simply selecting items from lists or by entering a word (or part thereof), number, or a collection of words or numbers into one or more data entry fields.  As each selection is made, the SQL script – often shown at the bottom of the screen - is modified accordingly.  When completed, the set of instructions (query) is sent to the database by clicking on the Retrieve button.  The results can be viewed, printed, stored as a file, or sent to an Excel worksheet.

 

 

Tab 1 – Welcome: Click on <Read> to view the objectives.

 

 

Tab 2 - Introduction: Click on <Read> for a brief introduction to the database.

 

Tab 3 - Background: Click on <Read> for an explanation of why stereological data can create a foundation for the Universal Biology Database.

 

Tab 4 – UBD data table: Select tab 4 and click on run.  This screen displays the main table of the Universal Biology Database 1.0 updated to 2.0.  Green identifies control data and blue experimental.  In the lower left hand corner of the screen – just to the left of the horizontal scrolling arrow – note the thick black line.  Using the mouse cursor, drag it to the right to produce a split screen.  The Excel worksheet used to calculate the decimal repertoire equations can be found at C:\Program Files\EBSTicket 2006\Files\2006_regression_equations_01.xls).

 

 

Tab 5 – UBD control data: Select tab 5 and click on run.  The screen displays the new query by example interface.  Notice that the same data fields appear under two different headings – data catalogue (left panel) and query by example (right panel).  Use the catalogue to discover what the database contains and the query panel to assemble a set of instructions for the database.  Numerous examples of query criteria appear in the drop down lists, illustrating the remarkable flexibility of this approach.  Simply work through some of the examples to see how the interface works.

 

 

A few introductory comments, however, may be helpful.  Items selected from either panel – catalogue or query – can be used in a search.  Items selected from the catalogue will retrieve all those items identical to the one chosen.  For example, selecting <bird> from the catalogue panel and clicking on the Retrieve button produces 58 responses.  In contrast, typing <like %bird%> into the organism field of the query panel produces 110 responses.  In the first case, only those rows containing just the word <bird> were retrieved, whereas in the second case all those entries containing the word <bird> alone or in a word string were retrieved.  Furthermore, note that entries coming from the catalogue panel are case sensitive (upper and lower case), whereas those from the query panel are not.

 

Let’s try a simple query together.  The objective of our search will be to retrieve all the data coming from the CA1 region of the control hippocampus.  As shown below, we have written our request in the X Name field <like %ca1%>, using the query panel.  Notice how the SQL script in the box at the bottom of the screen reflects our choice.

 

 

Clicking on the Retrieve button yields the result below, which includes a collection of 920 screens.

 

 

To view these data as a scrolling screen instead, press the view data button and the following screen appears.

 

 

When we scroll to the end of this table, two graphs appear.  The first shows a log-log plot of the X, Y data and the second a histogram of the decimal repertoire equations.  The log-log plot displays the published data of CA1 as a mathematical puzzle.  Finding a solution consists of using regression analysis to fit the points to power curves that carry coefficients of determination (r2) close to one.  This process generates a family of power curves – called repertoire equations - that define how CA1 is related to itself and to other structures in the brain.  Alternatively, the process can be simplified.  Since each data ratio in the Universal Biology Database is attached to a decimal repertoire equation, a collection of ratios automatically becomes a stack of equations.  Notice the distinct steps in the histogram of second graph.  They represent the repertoire equations as decimal steps and illustrate the total range of connections available to the CA1 region of the control hippocampus.  Recall that each connection also defines the proportion of the two parts, X and Y.        

 

 

To illustrate the process of finding equations, we can send this data table to an Excel worksheet and then do some curve fitting.  This is accomplished by clicking first on the to excel button and then on the from excel button.  To keep a copy of this worksheet, change the name from working.xls to something else.  Otherwise, the file will be written over the next time you click on the to excel button (To set the Excel path, click on INSTALL READER in the main menu).    

 

 

Using the graphing tools of Excel, we can readily express the connections between CA1 and all other structures as a family of power curves.  In the example below, the regression curve is calculated for those points belonging to the 0.5 decimal repertoire equation.  This equation tells us that two parts of structure X are connected to one part of structure Y (Y/X = ½ = 1:2 = 0.5), where the parts can be numbers of cells or volumes of compartments.  Notice that the r2 is indeed close to one (0.9999).  Generating these repertoire equations is a first step toward working out how structures are related to one another.  Once we know these relationships, the task of writing equations for structures all across the biological hierarchy of size becomes routine.  Recall that a solution to one of the puzzles last year consisted of writing the repertoire equations for the hippocampus and then connecting them to produce a network (Bolender, 2005).  This allowed us to predict the many parts of the hippocampus from a singe value – for five different animal species – in health and disease.  In short, the network of repertoire equations – produced by reverse engineering the hippocampus – illustrates how these equations can be used in developing algorithms for diagnosis and prediction.    

 

 

Tab 6 – UBD experimental data: Select tab 6 and click on run.  Here the interface is the same as the one just described for the control data.  Therefore, a brief example will suffice.  Let’s look for all the data on schizophrenia in males over the age of 60.  The screen below shows how we access these data, using our query by example interface.

 

 

The new puzzle shown below can be solved for the decimal repertoire equations, which in turn can be used to compare schizophrenia to males of other ages, to females, or to other diseases of the central nervous system.  Alternatively, we can search for potentially related structures simply by viewing the contents of an individual decimal step in the data table.

 

 

Tab 7 – UBD control + experimental data: Select tab 7 and click on run.  Here, the Universal Biology Database table includes all the control and experimental data.  Let’s try another query.  What is the effect of aging on mitochondria?  For the X structure we can type in <like %aging%> and for the Y structure <like %mito%>.   Our result is a collection of control and experimental points that can be evaluated either by inspection or by fitting the points to log-log regression lines (power curves). 

 

Tab 8 – UBD connection repertoire - Blueprint 1.0 - Control Data: The figure below illustrates the biological blueprint as a three-dimensional plot.  It represents a collection of equations defining the proportions of one structure to another.  In effect, it gives us an empirical view of the mathematical core of biology.   

 

 

Select tab 8 and click on run.  The connection repertoire table uses the decimal repertoire equations – expressed as proportions of whole numbers - to show how biological parts are connected by rule.  In effect, it provides a structural blueprint for biological parts larger than molecules in terms of a well-defined stoichiometry.  This mathematical overview of biology shows how structural connections define phenotypes and can provide insights into how, when, and where these phenotypes change. 

 

The connection repertoire table shown below identifies the structures in an X,Y pair and shows how they are connected quantitatively to each other and to related structures.  To be included in the table, the same data pair must appear at least three times in the database.  This represents a rigorous test of both the methods and the investigators in that the three data pairs typically come from three different papers.     

 

 

Several distinct patterns quickly emerge from this table.  A given pair of structures (X,Y) can display several distinct phenotypes, characterized as a multiple of whole numbers (X:Y).  For example, the proportion of mitochondria to peroxisomes can be 10:1, 20:1, and 33:1.  Notice also that different pairs of structures can share similar proportions.  This overlap can be used to link the equation associated with each data pair into a local or global network of equations.  Such networks provide a substrate for connecting other data types and become a platform for diagnosis and prediction.  Table 1 shows the data set that would be the starting point for assembling such a network for peroxisomes.

 

Table 1 Equations representing the proportions of organelles.

Data Pair Proportion

Decimal Repertoire Equation

Mitochondrion:Peroxisome

 

·          33:1

Y=0.03493X0.9999

·          20:1

Y=0.05459X0.9999

·          10:1

Y=0.12318X0.9999

·          25:1

Y=0.04442X0.9999

·          25:2

Y=0.08504X0.9998

·          14:1

Y=0.07455X0.9999

·          17:1

Y=0.06483X1.000

Nucleus:Peroxisome

 

·          5:1

Y=0.22397X0.9999

·          10:1

Y=0.12318X0.9999

·          7:1

Y=0.17464X0.9998

·          14:1

Y=0.07455X0.9999

·          5:2

Y=0.4468X0.9999

Lysosome:Peroxisome

 

·          3:1

Y=0.34610X0.9998

·          5:3

Y=0.64920X0.9998

·          3:2

Y=0.74784X0.9999

·          1:1

Y=1.19840X0.9996

·          1:2

Y=2.22114X0.9999

·          1:3

Y=3.44783X0.9998

Golgi:Peroxisome

 

·          5:1

Y=0.22397X0.9999

·          5:2

Y=0.44680X0.9999

·          3:2

Y=0.74784X0.9999

·          5:4

Y=0.84873X0.9999

·          1:1

Y=1.1984X0.9996

·          1:3

Y=3.44783X0.9998

Lipid Droplet:Peroxisome

 

·          10:1

Y=0.12318X0.9999

·          5:2

Y=0.44680X0.9999

·          5:4

Y=0.84873X0.9999

·          2:3

Y=1.72417X0.9999

·          1:10

Y=12.1598X0.9997

·          1:2

Y=2.22114X0.9999

·          1:3

Y=3.44783X0.9998

 

Why is access to the connection repertoire blueprint important?  Recall that biology often uses a remarkably similar genome to produce a great variety of different animal species.  Given our current understanding, it appears that we are a product of at least two interacting forces: our genes and the way they and their products produce and assemble our parts. 

 

The decimal repertoire equations suggest that biology has evolved a common parts inventory that it draws from when assembling people, mice, frogs, or fish.  The connection repertoire table allows us to explore phenotypes as a function of their basic building blocks, namely the decimal repertoire equations.  By defining phenotypes mathematically, we can study their life history in a given species and detect departures from what is expected to be normal.  The table also moves us closer to the genome.  When, for example, the proportions of the parts match the proportions of their constituent molecules, we can predict one from the other.  Of course, we might discover that some decimal repertoire equations can be explained simply by determining the number of duplicate genes being read at a given time or that exist in a given species.  Think of it this way.  If genes individually cannot determine a species, then perhaps the number of copies of a given gene can.   

 

If we summarize the connection repertoire table with a histogram, then the full range of phenotypic expression in biology can be seen.  Notice that practically all the connections can be captured with only about 50 equations, with far fewer doing most of the work.  The graph below shows that the connections between the parts tend to define five major peaks, each showing a clear preference for a specific proportion. 

 

 

When we focus on the connections of a single structure, such as the mitochondrion, a slightly different pattern appears.  Although this organelle uses decimal repertoire equations from fewer peaks, the positions of the peaks remain more or less the same as they appear in the total data set.

 

 

Finally, we can use the connection repertoire to make a few preliminary observations as to the biological preferences.  Of the total entries (4,296), roughly 40% occur in six decimal repertoire equations (Table 2).   

 

Table 2 Decimal Repertoire - Total Data Set – Most Popular Equations and Proportions

Decimal Repertoire Equation

Sum

%

Proportion (X:Y)

0.02

106

6.5

50 to 1

0.1

237

14.6

10 to 1

0.3

296

18.3

3 to 1

1.0

469

29

1 to 1

1.5

311

19

2 to 3

10

200

12

1 to 10

 

When we consider just counts of neurons (Table 3), we find that almost 70% of the connections occur in six decimal repertoire equations that define only five proportions: 3 to 1, 2 to 1, 3 to 2, 1 to 1, and 2 to 3.  Notice that the proportions are largely ratios of small whole numbers – curiously reminiscent of biochemical stoichiometry and the law of multiple proportions. 

 

Table 3 Decimal Repertoire – Numbers of Neurons – Most Popular Equations and Proportions

Decimal Repertoire Equation

Sum

%

Proportion (X:Y)

0.3

39

17

3 to 1

0.5

18

8

2 to 1

0.7

25

11

3 to 2

0.9

24

11

       ~ 1 to 1

1.0

91

40

1 to 1

1.5

30

13

2 to 3

  

Since both the central and peripheral nervous systems rely on tandem connections between neurons, disrupting cell proportions at any level may generate a variety of predictable consequences – upstream and down.  Unintended consequences also exist.  Last year, for example, the connection matrix for the lateral geniculate nucleus uncovered the disturbing fact that altering the genome of mice – at locations considered unrelated to the nervous system – can actually change the proportions of cells in the brain (Seecharan et al., 2003; Bolender, 2005). 

 

Tab 9 – UBD change: Select tab 9 and click on run.  The change data come from the design codes described previously (Bolender, 2003-2005).  In this screen, X identifies control or experimental data and Y experimental.  A ratio >1 indicates an increase (red), <1 a decrease (blue), and =1 no change (green).  Here the proportions are largely ratios of small whole numbers – once again reminiscent of the law of multiple proportions.  Finally, bear in mind that these decimal repertoire equations belong exclusively to change data.  

 

Let’s use this screen to see what exposures can change the hippocampus. Type <like %hippo%> into the X Structure field.  Click on the Retrieve button and then on the show data button.  The screen below identifies the direction of the change and conditions responsible.  Such a sort provides insights into the repertoire of change available to the hippocampus.  Notice, for example, how different conditions can produce both similar and different responses – in similar and different species.  For further information, see Puzzle 2: The Hippocampus, (Bolender, 2005) 

 

 

Scroll to the bottom of the screen.  Notice in the distribution histogram that most parts of the hippocampus change only slightly or not at all – a general pattern that persists thr