In: Enterprise Biology Software,
Version 6.0 © 2006 Robert P. Bolender
Enterprise Biology Software: VI. Research (2006)
Robert P. Bolender
Summary
The Enterprise
Biology Software Project explores challenging questions in the life
sciences by looking for answers in the biology literature. Since many of these questions can be
translated into mathematical puzzles, they can be solved – for the most part -
by generating empirical equations. In
turn, these equations can produce new and more difficult puzzles often
extending to the mathematical core of biology.
For example, the progress report this year describes how we can solve
one of these harder puzzles by first separating two interacting complexities
and then unfolding each complexity in its turn. In effect, the question: “How do we reverse engineer biology?”
becomes a double puzzle: “How do we separate two complexities (methods and
biology) and how do we unfold and refold them?” A solution to the methods puzzle becomes a Universal Biology
Database, which in turn can be used to solve the biology puzzle by
characterizing biological phenotypes with equations. Such phenotypes can be represented either as a single equation or
as a stack thereof. When summarized
across biology, these equations offer a glimpse of the core by revealing the
mathematical organization of the parts as a biological blueprint. There is, however, one small surprise. In contrast to the widely held view that
biology is largely a nonlinear system, most of these phenotypes turn out to be
linear. This finding is most welcome in
that it greatly simplifies the task of reverse engineering biology. By including a query by example interface,
the 2006 software package offers ready access to this new data-driven biology
and may become one of our technology tickets to the future. Imagine – if you will - a time when each
research paper becomes a puzzle with a unique mathematical solution, where
biology operates according to a rulebook, and we can read from that book. Interested?
I invite you to run the software package.
Introduction
Biology plays by
the rules. By discovering these rules
and also playing by them, we improve our chances for success – in whatever we
attempt. This defines a central
strategy of the Enterprise Biology Software Project. Indeed, this process of discovery based on
solving mathematical puzzles is wonderfully simple and can be applied by
everyone. All we have to do is follow
the clues to find the rules that solve the puzzles. Each solution, of course, generates a new collection of clues and
the process continues from one puzzle to the next.
This rule-based
approach has yielded two key pieces of information. We now know that repertoire equations can unfold complexity into
a well-defined order, wherein these power equations have coefficients of
determination approaching one (r2~1). Moreover, these equations often display an exponent a (Y=bXa)
that also tends toward one. This means
that the repertoire equations effectively become linear (Y=bX).
Since the
repertoire equations define order in biology as the proportion of two parts X
and Y, we can assemble enough information to begin the process of reverse
engineering biology. Moreover, we can
design a Universal Biology Database with virtually unlimited scaling
properties by simply arranging the ratios of the parts (Y/X) in numerical order
and assigning them to decimal steps.
When fitted to regression curves, these data pairs become decimal
repertoire equations. The database
becomes universal because it can accept data from all disciplines capable of
forming ratios of structures, which of course includes molecules. By storing data from many different
disciplines in the same database table, the data become integrated -
mathematically – across disciplines, animals, settings, et cetera.
The software is
full of surprises. When running the
program from tab 8 of the Universal Biology Database, for example, we
can easily compare the proportions of the same two structures in control and
experimental settings. I was surprised
by how often controls and experimentals share the same decimal repertoire
equation – even when the absolute amounts differ. This tells us that biology prefers to make more or less of the
same thing, rather than making something different. Changes in the proportions of structures, however, can be readily
found in development, disease, and aging.
Perhaps the most reassuring finding of all, however, is the fact that
many different labs routinely produce the same results (i.e., the experimental
data share the same decimal repertoire equation(s)). This shows that both the methods and the researchers applying
them are indeed capable of generating reproducible results.
The new software
package comes with a new learning curve, but one that has a very gentle
slope. Specifically, by adding a “query
by example” interface we can write and send complex instructions to the
database. This provides full access to
the information with great ease and flexibility. Simply working through the examples should be all that is
required to become comfortable with the new format.
Although we’ll
consider these topics later in this report, a brief introduction here may be
helpful. Why do we want to reverse
engineer biology? The answer is
painfully simple. It encourages biology
to become a data-driven science. Recall
that chemistry is a product of physics and that biology a product of both
physics and chemistry. Chemistry, for
example, is very effective as a data-driven science because it plays by a set
of stoichiometry rules. We know these
rules and can use them to explain how small parts (atoms) are connected to form
larger molecules and compounds and to write balanced equations for chemical
reactions. In other words, we have
access to the mathematical core of chemistry.
Do analogous rules and balanced equations exist for biology? Yes, of course they do. The only difference is that the rules of
biology are much better hidden than those of chemistry and must be extracted
from the research literature with the help of mathematics and technology. What evidence supports this claim? Biological stereology can provide the
balanced equations and a Universal Biology Database the stoichiometry of
the parts. In effect, we can produce a
biological blueprint that shows how the parts of biology larger than molecules
are connected by rule. As such, this
table provides the first of many new access routes to the mathematical core of
biology – exactly what we would expect from a data-driven science.
Methods and
Results
The main purposes
of the Progress Report this year are to introduce (1) a new interface for the
literature database and (2) to explore an engineering model (forward and
reverse) for biological research. As
always, the best introduction to the software package comes from running the
programs.
The central feature
of the first universal database was a single table that could accommodate both
control and experimental data – even when coming from many different
disciplines (Bolender, 2005).
Effectively, it defined a simple - yet powerful - way of integrating
diverse data within and across research fields. Unfortunately, most of the details – defined as data entry fields
in the original databases - were no longer available. The new database remedies this shortcoming by connecting all
three databases into a single operational unit.

By adding a “query by example” interface to the Universal Biology Database 2, even the beginner can quickly learn to write intricate SQL (structured query language) scripts and submit them to the database.

This year the
software is being distributed on a mini CD, one that fits conveniently in a
standard envelope. The CD includes a
relational database, a runtime database engine, and a small collection of programs
and documents.
The main menu of
the program includes the progress report and an index of programs. Click on an item in the list to view
it.

1. Progress Report: A pdf file (this document).
2. Universal Biology Database: The database folder includes eleven tabs - six of which can be used to run programs. A central feature of the new software consists of a query by example interface. It allows us to define a database query by simply selecting items from lists or by entering a word (or part thereof), number, or a collection of words or numbers into one or more data entry fields. As each selection is made, the SQL script – often shown at the bottom of the screen - is modified accordingly. When completed, the set of instructions (query) is sent to the database by clicking on the Retrieve button. The results can be viewed, printed, stored as a file, or sent to an Excel worksheet.


Tab 2 - Introduction: Click on <Read> for a brief introduction to the
database.
Tab 3 - Background:
Click on <Read> for an explanation of why stereological data can create a
foundation for the Universal Biology Database.
Tab 4 – UBD data table: Select tab 4 and click on run. This screen displays the main table of the Universal Biology
Database 1.0 updated to 2.0.
Green identifies control data and blue experimental. In the lower left hand corner of the screen
– just to the left of the horizontal scrolling arrow – note the thick black
line. Using the mouse cursor, drag it
to the right to produce a split screen.
The Excel worksheet used to calculate the decimal repertoire equations
can be found at C:\Program Files\EBSTicket
2006\Files\2006_regression_equations_01.xls).

Tab 5 – UBD control data: Select tab 5 and click on run. The screen displays the new query by example interface. Notice that the same data fields appear
under two different headings – data catalogue (left panel) and query by example
(right panel). Use the catalogue to
discover what the database contains and the query panel to assemble a set of
instructions for the database. Numerous
examples of query criteria appear in the drop down lists, illustrating the
remarkable flexibility of this approach.
Simply work through some of the examples to see how the interface works.

A few introductory
comments, however, may be helpful.
Items selected from either panel – catalogue or query – can be used in a
search. Items selected from the
catalogue will retrieve all those items identical to the one chosen. For example, selecting <bird> from the
catalogue panel and clicking on the Retrieve button produces 58
responses. In contrast, typing <like
%bird%> into the organism field of the query panel produces 110
responses. In the first case, only
those rows containing just the word <bird> were retrieved, whereas in the
second case all those entries containing the word <bird> alone or in a
word string were retrieved.
Furthermore, note that entries coming from the catalogue panel are case
sensitive (upper and lower case), whereas those from the query panel are not.
Let’s try a simple query together. The objective of our search will be to retrieve all the data coming from the CA1 region of the control hippocampus. As shown below, we have written our request in the X Name field <like %ca1%>, using the query panel. Notice how the SQL script in the box at the bottom of the screen reflects our choice.

Clicking on the Retrieve
button yields the result below, which includes a collection of 920 screens.

To view these data
as a scrolling screen instead, press the view data button and the
following screen appears.

When we scroll to
the end of this table, two graphs appear.
The first shows a log-log plot of the X, Y data and the second a
histogram of the decimal repertoire equations.
The log-log plot displays the published data of CA1 as a mathematical
puzzle. Finding a solution consists of
using regression analysis to fit the points to power curves that carry
coefficients of determination (r2) close to one. This process generates a family of power
curves – called repertoire equations - that define how CA1 is related to itself
and to other structures in the brain.
Alternatively, the process can be simplified. Since each data ratio in the Universal Biology Database is
attached to a decimal repertoire equation, a collection of ratios automatically
becomes a stack of equations. Notice
the distinct steps in the histogram of second graph. They represent the repertoire equations as decimal steps and
illustrate the total range of connections available to the CA1 region of the
control hippocampus. Recall that each
connection also defines the proportion of the two parts, X and Y.

To illustrate the
process of finding equations, we can send this data table to an Excel worksheet
and then do some curve fitting. This is
accomplished by clicking first on the to excel button and then on the from
excel button. To keep a copy of
this worksheet, change the name from working.xls to something else. Otherwise, the file will be written over the
next time you click on the to excel button (To set the Excel path, click
on INSTALL READER in the main menu).

Tab 6 – UBD experimental data: Select tab 6 and click on run. Here the interface is the same as the one
just described for the control data.
Therefore, a brief example will suffice. Let’s look for all the data on schizophrenia in males over the
age of 60. The screen below shows how
we access these data, using our query by example interface.

The new puzzle
shown below can be solved for the decimal repertoire equations, which in turn
can be used to compare schizophrenia to males of other ages, to females, or to
other diseases of the central nervous system.
Alternatively, we can search for potentially related structures simply
by viewing the contents of an individual decimal step in the data table.

Tab 7 – UBD control + experimental data: Select tab 7 and click on run. Here, the Universal Biology Database
table includes all the control and experimental data. Let’s try another query.
What is the effect of aging on mitochondria? For the X structure we can type in <like %aging%> and for
the Y structure <like %mito%>.
Our result is a collection of control and experimental points that can
be evaluated either by inspection or by fitting the points to log-log
regression lines (power curves).
Tab 8 – UBD connection repertoire - Blueprint 1.0 - Control Data: The figure below illustrates the biological blueprint as a three-dimensional plot. It represents a collection of equations defining the proportions of one structure to another. In effect, it gives us an empirical view of the mathematical core of biology.

Select tab
8 and click on run. The connection
repertoire table uses the decimal repertoire equations – expressed as
proportions of whole numbers - to show how biological parts are connected by
rule. In effect, it provides a
structural blueprint for biological parts larger than molecules in terms of a
well-defined stoichiometry. This
mathematical overview of biology shows how structural connections define
phenotypes and can provide insights into how, when, and where these phenotypes
change.
The connection repertoire table shown below identifies the structures in an X,Y pair and shows how they are connected quantitatively to each other and to related structures. To be included in the table, the same data pair must appear at least three times in the database. This represents a rigorous test of both the methods and the investigators in that the three data pairs typically come from three different papers.

Several
distinct patterns quickly emerge from this table. A given pair of structures (X,Y) can display several distinct
phenotypes, characterized as a multiple of whole numbers (X:Y). For example, the proportion of mitochondria
to peroxisomes can be 10:1, 20:1, and 33:1.
Notice also that different pairs of structures can share similar
proportions. This overlap can be used
to link the equation associated with each data pair into a local or global
network of equations. Such networks
provide a substrate for connecting other data types and become a platform for
diagnosis and prediction. Table 1 shows
the data set that would be the starting point for assembling such a network for
peroxisomes.
Table 1 Equations representing the proportions of organelles.
|
Data Pair Proportion |
Decimal Repertoire Equation |
|
Mitochondrion:Peroxisome |
|
|
·
33:1 |
Y=0.03493X0.9999 |
|
·
20:1 |
Y=0.05459X0.9999 |
|
Y=0.12318X0.9999 |
|
|
·
25:1 |
Y=0.04442X0.9999 |
|
·
25:2 |
Y=0.08504X0.9998 |
|
·
14:1 |
Y=0.07455X0.9999 |
|
·
17:1 |
Y=0.06483X1.000 |
|
Nucleus:Peroxisome |
|
|
·
5:1 |
Y=0.22397X0.9999 |
|
·
10:1 |
Y=0.12318X0.9999 |
|
·
7:1 |
Y=0.17464X0.9998 |
|
·
14:1 |
Y=0.07455X0.9999 |
|
·
5:2 |
Y=0.4468X0.9999 |
|
Lysosome:Peroxisome |
|
|
·
3:1 |
Y=0.34610X0.9998 |
|
·
5:3 |
Y=0.64920X0.9998 |
|
·
3:2 |
Y=0.74784X0.9999 |
|
·
1:1 |
Y=1.19840X0.9996 |
|
·
1:2 |
Y=2.22114X0.9999 |
|
·
1:3 |
Y=3.44783X0.9998 |
|
Golgi:Peroxisome |
|
|
·
5:1 |
Y=0.22397X0.9999 |
|
·
5:2 |
Y=0.44680X0.9999 |
|
·
3:2 |
Y=0.74784X0.9999 |
|
·
5:4 |
Y=0.84873X0.9999 |
|
·
1:1 |
Y=1.1984X0.9996 |
|
·
1:3 |
Y=3.44783X0.9998 |
|
Lipid Droplet:Peroxisome |
|
|
·
10:1 |
Y=0.12318X0.9999 |
|
·
5:2 |
Y=0.44680X0.9999 |
|
·
5:4 |
Y=0.84873X0.9999 |
|
·
2:3 |
Y=1.72417X0.9999 |
|
·
1:10 |
Y=12.1598X0.9997 |
|
·
1:2 |
Y=2.22114X0.9999 |
|
·
1:3 |
Y=3.44783X0.9998 |
Why
is access to the connection repertoire blueprint important? Recall that biology often uses a remarkably
similar genome to produce a great variety of different animal species. Given our current understanding, it appears
that we are a product of at least two interacting forces: our genes and the way
they and their products produce and assemble our parts.
The
decimal repertoire equations suggest that biology has evolved a common parts
inventory that it draws from when assembling people, mice, frogs, or fish. The connection repertoire table allows us to
explore phenotypes as a function of their basic building blocks, namely the
decimal repertoire equations. By
defining phenotypes mathematically, we can study their life history in a given
species and detect departures from what is expected to be normal. The table also moves us closer to the
genome. When, for example, the
proportions of the parts match the proportions of their constituent molecules,
we can predict one from the other. Of
course, we might discover that some decimal repertoire equations can be
explained simply by determining the number of duplicate genes being read at a
given time or that exist in a given species.
Think of it this way. If genes
individually cannot determine a species, then perhaps the number of copies of a
given gene can.
If we summarize the connection repertoire table with a histogram, then the full range of phenotypic expression in biology can be seen. Notice that practically all the connections can be captured with only about 50 equations, with far fewer doing most of the work. The graph below shows that the connections between the parts tend to define five major peaks, each showing a clear preference for a specific proportion.

When
we focus on the connections of a single structure, such as the mitochondrion, a
slightly different pattern appears.
Although this organelle uses decimal repertoire equations from fewer
peaks, the positions of the peaks remain more or less the same as they appear
in the total data set.

Finally,
we can use the connection repertoire to make a few preliminary observations as
to the biological preferences. Of the
total entries (4,296), roughly 40% occur in six decimal repertoire equations
(Table 2).
Table 2 Decimal Repertoire - Total Data Set – Most Popular Equations and Proportions
|
Decimal
Repertoire Equation |
Sum |
% |
Proportion
(X:Y) |
|
106 |
6.5 |
50
to 1 |
|
|
0.1 |
237 |
14.6 |
10
to 1 |
|
0.3 |
296 |
18.3 |
3
to 1 |
|
1.0 |
469 |
29 |
1
to 1 |
|
1.5 |
311 |
19 |
2
to 3 |
|
10 |
200 |
12 |
1
to 10 |
When
we consider just counts of neurons (Table 3), we find that almost 70% of the
connections occur in six decimal repertoire equations that define only five
proportions: 3 to 1, 2 to 1, 3 to 2, 1 to 1, and 2 to 3. Notice that the proportions are largely
ratios of small whole numbers – curiously reminiscent of biochemical
stoichiometry and the law of multiple proportions.
Table 3 Decimal Repertoire – Numbers of Neurons – Most Popular Equations and Proportions
|
Decimal
Repertoire Equation |
Sum |
% |
Proportion
(X:Y) |
|
39 |
17 |
3
to 1 |
|
|
0.5 |
18 |
8 |
2
to 1 |
|
0.7 |
25 |
11 |
3
to 2 |
|
0.9 |
24 |
11 |
~ 1 to 1 |
|
1.0 |
91 |
40 |
1
to 1 |
|
1.5 |
30 |
13 |
2
to 3 |
Since both
the central and peripheral nervous systems rely on tandem connections between
neurons, disrupting cell proportions at any level may generate a variety of predictable
consequences – upstream and down.
Unintended consequences also exist.
Last year, for example, the connection matrix for the lateral geniculate
nucleus uncovered the disturbing fact that altering the genome of mice – at
locations considered unrelated to the nervous system – can actually change the
proportions of cells in the brain (Seecharan et al., 2003; Bolender,
2005).
Tab 9 – UBD change: Select tab 9 and click on run. The change data come from the design codes described previously
(Bolender, 2003-2005). In this screen,
X identifies control or experimental data and Y experimental. A ratio >1 indicates an increase (red),
<1 a decrease (blue), and =1 no change (green). Here the proportions are largely ratios of small whole numbers –
once again reminiscent of the law of multiple proportions. Finally, bear in mind that these decimal
repertoire equations belong exclusively to change data.
Let’s use this screen to see what exposures can change the hippocampus. Type <like %hippo%> into the X Structure field. Click on the Retrieve button and then on the show data button. The screen below identifies the direction of the change and conditions responsible. Such a sort provides insights into the repertoire of change available to the hippocampus. Notice, for example, how different conditions can produce both similar and different responses – in similar and different species. For further information, see Puzzle 2: The Hippocampus, (Bolender, 2005)

Scroll to the bottom of the screen. Notice in the distribution histogram that most parts of the hippocampus change only slightly or not at all – a general pattern that persists thr