In: Enterprise Biology Software,
Version 4.0 © 2004 Robert P. Bolender
Enterprise Biology Software: V. Research (2004)
Robert P. Bolender
Summary
The Enterprise
Biology Software Project explores new approaches to discovery in the
life sciences by applying mathematics and technology to the biology
literature. This process includes (1)
standardizing the literature by moving research data from the pages of journals
into the tables of a relational database, (2) generating derived data libraries
from the database, (3) searching these new libraries for mathematical patterns,
and (4) distributing the databases, libraries, software, and observations
freely to contributing authors - on a CD.
The current release includes an updated stereology literature database,
new libraries (repertoire, analogy, drill-down, and ladder), a progress
report, and several new findings. In
short, the libraries continue to uncover widespread patterns of mathematical
order in biology. These patterns appear
not as properties of individual structures, but rather as connections between
structures. Connections, for example,
define repertoires of equations that form networks. In the report, we consider how these equations might help us
decipher the genetic regulatory networks.
Introduction
Background
The results of the Enterprise Biology Software Project
suggest that the mathematical core of biology is well hidden by several
factors, including the bias of our experimental methods, the fluctuation of
biological systems between equilibrium and nonequilibrium states, and the
confounding nature of change. Since the
success of the project depends on accessing this mathematical core, all these
issues had to be dealt with locally and globally - in both control and
experimental settings. This was
accomplished by storing published research data in a relational database and
then generating libraries of standardized data there from (Bolender,
2001-2004).
As the database
grew, it became apparent that these new libraries (data pairs; design codes)
could generate additional libraries consisting of equations. The equation libraries created a new
opportunity for exploring biology as a mathematical puzzle, one that could be
solved one step at a time by following the clues accompanying each new
library. In effect, we now know how to
translate the biology literature into collections of standardized
equations. The challenge for the reader
and for the writer as well is to find the clues and then use them to solve
additional pieces of the biology puzzle.
How do we find the
clues? Last year, you may recall that
the ladder equation library showed us that biological data expressed first as
data pairs and then as power equations could be summarized by a single
exponential equation. This suggested that
all the parts of biology might be ordered by a single rule. The major clue to come from that exercise
was that order could be found as power equations with r2 ≥ 0.999 by sorting the data
pairs as ratios (y/x) three or more rows at a time. In other words, we learned how to convert data pairs into
equations - quickly. The next clue came
from the observation that the rung equations grouped the data pairs at distinct
levels, resembling quantum steps. A
similar clue came from the design code library when it showed us that change
behaved quite unexpectedly - like a constant. Finally, the example of unfolding complexity by shifting our
perspective from a zero to a one-dimensional platform provided the catalyzing
clue.
Armed with so many
clues, the next piece of the puzzle was in easy reach. Let us begin with the fourth clue, the one
that recommended a shift in our perspective.
The promise here was that changing the platform for viewing data can
change what we see. Consider this. What would happen if instead of looking at
biology thorough our eyes we looked at the same biology through the eyes of
one of its parts? Although such a
question seems a bit odd at first, it turns out to be a quite good one because
it takes us to an interesting result.
If we assume for convenience - that any given part sees all those
parts to which it is connected mathematically, then we can share the
perspective of that part by simply writing the appropriate equations. The result becomes interesting to the larger
community if the parts found to be connected mathematically turn out to have
been produced by a similar genetic regulatory network. If true, then we have found a relatively
simple way of reverse engineering these otherwise elusive networks. In short, the perspective clue - supported
by the first three produced a new library (repertoire), one that may be
offering us our first glimpse of what organelles and cells may be
seeing. Along with our glimpse,
however, comes a view of biology that if confirmed suggests a level of
complexity for genetic regulation that may help to explain why genomes can do
so much more than just code for proteins.
The other new
libraries are more or less straightforward.
One uses equations to suggest a strategy for connecting larger
structures to molecules and genes (drill-down), whereas the other (analogy)
employs equations to hunt for similarities in the biology literature -
mathematically. Finally, a ladder
equation library was added for experimental data.
Progress
Currently,
the major activities of the Enterprise Biology Software Project
consist of entering data and generating new libraries. Research data submitted by authors were
added to the database and sent back as part of a technology package, as shown
below.
Start
![]()
Research Data (reprints contributed by authors)
![]()
![]()
![]()
Standardized Data (stored in a relational database)
![]()
Data Pair and Design Code Libraries (connected data sets; minimized bias)
![]()
Equation Libraries (mathematical patterns - local and global)
![]()
Enterprise Biology Software Package
![]()
Sent to Contributing Authors
First,
data were moved from research papers into a relational database and
standardized. For each paper, all the
data at a given hierarchical level of size (1 to 16) were then used to form
pairs of data, which juxtaposed control vs. control (data pairs) or control vs.
experimental (design codes). These two
basic data libraries were stored as database tables and used to generate the
equations that populate the derived data libraries. Writing equations consisted of filtering and sorting the data,
sending the results to an Excel worksheet, and plotting the x and y values as
power equations (y = bxa).
Finding these equations was simplified by sorting the ratio y/x from low
to high, and looking for an r2 = 0.999 or better with three or more
rows of data. The resulting graphs were
stored in Excel files and included with the software upgrade. The libraries old and new are listed in
Table 1.
Libraries: Libraries serve as
discovery platforms (Table 1). They
include one or more user interface screens, data, help files, and worked
examples (e.g., Excel worksheets; case studies). In effect, each new library helps to solve another piece of the
biology puzzle.
Table 1. Enterprise Biology Software Libraries.
|
Library |
Data |
Entries |
Applications |
|
Standardized Stereology Literature |
|
|
|
|
·
Citation
search |
original |
12,853 |
Find
references |
|
·
Citation
by paper contl |
original |
1,024 |
Print
paper contl data |
|
·
Citation
by paper contl + exptl |
original |
6,438 |
Print
paper contl + exptl data |
|
·
Methods
search SQL script |
original |
1,951 |
Find
papers by methods |
|
·
Control
Data |
original |
15,521 |
|
|
·
Experimental
Data |
original |
9,677 |
|
|
·
Contl
data by data point |
original |
12,164 |
Find data
by data point; level |
|
·
Contl+Exptl
data by data point |
original |
7,284 |
Find data
by data point; level |
|
·
Percentage
change data |
derived |
7,018 |
Find data
by change; level |
|
·
Phenotype
data |
original |
7,018 |
Find data
across 14 levels |
|
Connection Map |
|
|
|
|
·
Type
1 (2str/2+points/1level/1paper) |
derived |
182 |
Find
connections/minimize bias |
|
·
Type
2 (2+str/2+points/1level/1paper) |
derived |
81 |
Find
connections/minimize bias |
|
·
Type
3(2+str/2+points/1+levels/1paper) |
derived |
323 |
Find
connections/minimize bias |
|
·
Type
4 (data pairs) |
derived |
22,445 |
Find
connections/minimize bias |
|
Data Replicator |
|
|
|
|
·
One from
one (data from 1 paper) |
derived |
702 |
Predict
data |
|
·
Many
from one (data from 1+ papers) |
derived |
27 |
Predict
data |
|
Biological Algorithm |
|
|
|
|
·
Connections
upstream and down |
derived |
458 |
Predict
organs and organisms |
|
Data Pair |
|
|
|
|
Global
(data from 1+ papers) |
derived |
112 |
Find
connections/minimize bias |
|
Design Code |
|
|
|
|
·
Local
(data from 1 paper) |
derived |
2398 |
Identify
and predict change |
|
·
Global
(data from1+ papers) |
derived |
58 |
Identify and
predict change |
|
Ladder Equation (Data Pairs) |
|
|
|
|
·
Total
data pairs |
derived |
25 |
Generalize
structure in biology |
|
·
Organ |
derived |
19 |
Generalize
structure by organ |
|
·
Cell |
derived |
19 |
Generalize
structure by cell |
|
·
Organelle |
derived |
22 |
Generalize
structure by organelle |
|
|
|
|
|
|
New for 2004 |
|
|
|
|
Repertoire (Data Pairs) |
|
|
|
|
·
Organelles,
inclusions, and cells |
derived |
771 |
Find
connections; Make predictions |
|
Ladder Equation (Design Codes) |
|
|
|
|
·
Total
design codes |
derived |
25 |
Generalize
structure in biology |
|
Analogy (Design Codes) |
|
|
|
|
·
Selected
design codes |
derived |
140 |
Look for
similar changes |
|
Drill-Down (Design Codes) |
|
|
|
|
·
Selected
design codes |
derived |
183 |
Simplify
complexity |
Figure 1 indicates that the stereology literature database currently includes 60,000 data entries, of which more than half represent derived data. This community resource offers abundant opportunities for finding connections between and among the many parts that define biology.

Figure 1. Research data stored in the stereology literature database.
Results: The principle findings
of the project are listed below.
2001 to 2003
· Biological data can be transferred from
research papers to a relational database and standardized.
· The production database demonstrates the
feasibility of creating an electronic literature for the life sciences.
· A connection model for research biology
yields widespread mathematical patterns, whereas the traditional change model
does not.
· When stored in a database, published
research data serve as a key resource for producing derived data.
· Biological data are subject to an uncertainty
principle and therefore carry an unknown bias.
· Libraries can be designed that minimize bias
(data pairs, design codes).
· Structures in biology are connected by rule
(connection model).
· Algorithms can be written that generate
organs and organisms from a single seed value.
· Complex research data can be unfolded by
viewing data from a higher dimension.
· Relationships of structure to function can
be expressed mathematically.
· Change in biology can be generalized and predicted.
· More than twenty thousand connections
between structures in biology can be summarized by a single exponential
equation.
New for 2004
· The organization of biological parts can be
defined explicitly as repertoires of equations.
· The repertoire library views connections
from different structural perspectives by forming networks of equations.
· Experimental data can be summarized by a
single exponential equation, as shown earlier for control data.
· Mathematical analogies can encourage
serendipity.
· Drilling down into a data set can reveal the
presence of nested equations.
· Optimization may be a first principle of
biology.
Repertoire Library
(Data Pairs)
The repertoire
library offers views of the same published data from many different
perspectives by transforming tabular data into networks of equations. These networks, which show how structures
are connected mathematically, are displayed as collections of power
equations.
Organelles: If we imagine that each of the many parts
of biology sees its world from a unique perspective, then what might we learn
by sharing these different perspectives?
For example, what parts of a cell would be of the greatest interest to a
mitochondrion? One way of answering
such a question is to list all those parts to which a mitochondrion has a
mathematical connection. In other
words, we can define a mitochondrial perspective as family of power equations
(y = bxa) wherein mitochondria (expressed as a volume or surface)
would be the x variable:
y(organelle i) =
bx(mitochondria)a .
Generating these
equations is a simple task, using the literature database. Start with the data pairs table (included
with the current EBS upgrade), type <mito> into the x name field, press
Enter, click on the sort y radio button, and save the results as an Excel
file. In Excel, notice that all the
terms in the x name column begin with mito
, whereas the organelles in the y
name column carry different names sorted alphabetically. Start with Golgi and sort the x/y column
numerically (low to high). Next,
highlight the first three data pairs of Golgi and select a scatter graph. Change the x and y axes to logs and fit the
points with a power regression line. If
the r2 is greater than 0.999, add extra points (row by row) to the
graph until the r2 comes close to 0.999. If not, move the calculation box down one row at a time until the
r2 becomes greater than or equal to 0.999. The power equation that appears on the graph describes a
mathematical connection between mitochondria and Golgi. Repeat the process for all the organelles
and inclusions connected to mitochondria and the resulting set of equations
becomes the mitochondrial perspective.
(Note that for each graph, as shown in Figure 2, there is the
corresponding Excel worksheet.)

1)
BIOLOGYtabs 4.4.
2) Excel worksheet.
Figure 2. The repertoire library for mitochondria consists of a set of power equations. Calculations can be viewed by opening an Excel worksheet.
In turn, these
equations shown above as lines can be programmed as computed fields on a
work screen wherein a given value for a mitochondrion will generate the
expected values for all the connected structures.

Figure 3. The repertoire library can also display the equations illustrated as regression lines in Figure 2 as an interactive network of equations. Apply a change to mitochondria and see how the connected organelles respond.
The first thing we
can learn from this perspective is that mitochondria share connections with
several cytoplasmic organelles. Notice,
however, that mitochondria display multiple connections with the same organelle
each one being identified by a unique power equation. Similar patterns of order appear when
perspectives are generated for the remaining organelles and inclusions. In other words, the proportion of organelles
one to another typically appears as a collection of fixed
relationships. In effect, the library
includes a repertoire (catalogue) of mathematical connections for organelles
that can occur across biology. It tells
us that organelles are connected by rule and these rules can be captured by
equations.
The repertoire
library for organelles and inclusions contains 513 power equations typically
with r2s equal to or better than 0.999. If organelles maintain these stoichiometric-like proportions when
an equilibrium state is reached after a change, then we can expect that a
change in mitochondria will be accompanied by predictable changes in its
connected neighbors. However, each of
its neighbors be it an organelle or inclusion has its own set of connections. This means that an intricate web of
connections could therefore translate the change we typically observe in a
single organelle into a widespread effect.
In such a setting, change would enjoy an unexpected wealth of
complexity. Recall that complexity may
be natures way of producing new emergent properties (Bolender, 2001).
Notice that Figure
3 presents us with a challenging problem along with a new set of clues. If we imagine that the connections among the
downstream products of a genetic regulatory network can serve as quantitative
markers of that network, then how would we write a set of equations describing
the network for a specific cell in a specific setting? The solution to such a problem consists of
selecting equations from each of the organelle columns that can describe appropriately
a specific cell type in a specific setting.
In turn, this connected set of equations would be expected to predict
the relative proportions of organelles in the specific cell. Our reward for solving this problem might be
a compound mirror of equations with which to view earlier genetic activity
not unlike the way an astronomer looks at stars. The question, of course, is how do we select the appropriate
equation(s) from each of the organelle columns?
The answer may be
as simple as moving from one dimension to another. Consider the following experiment. Estimate the densities of the eight organelles in the specific
cell type and calculate the ratio y/x for each data pair to get the proportions
of the organelles (y(organelle) = bx(mitochondion)a). Finally, compare the new proportions to
those in Figure 3 and select the closest match. In effect, we can use Figure 3 as a lookup table. More importantly, perhaps, we now know how
to turn a zero dimensional data point (y/x) without connections into a one
dimensional power equation with connections.
In other words, by using the repertoire equations as a lookup table, we
can transform an information poor data point into an information rich line with
surprisingly little effort. Indeed,
such dimensional shifts may offer a host of new products and clues.
Cells: The repertoire library for cells identifies mathematical
connections between cells and as just described for organelles the power
equations display the step-like pattern.
However, the cell-to-cell relationships are complex. The proportion of cells can be identified
within and across species as (1) one cell to one cell or (2) one cell to many
cells. For example, the proportion of
pulmonary endothelial cells to type II cells in the mouse, goat and rat is the
same (one to one), but this same proportion is also shared between endothelial
cells and fat-storing, interstitial, Kupffer, macrophage, mesenchymal, and
glial cells (one to many). The
proportion is expressed as y = 0.2693x0.9973, where x = endothelial
cell and y = cell i. Such a result,
which persists for many combinations of cells, suggests that proportions of
cells are ordered by rule and that these rules are being conserved across species. Although the genetic mechanism responsible
for controlling cell proportions is unknown, at least we now know that such
proportions exist and that they can be quantified.
Global View: The
repertoire library was also used to look for general patterns of organization. To generate global views for organelles and
cells, the power equations were fitted to exponentials as described earlier for
ladder equations (Bolender, 2003). When
plotted as a group (without regression lines) they offer a striking view of
organelle connections (Figure 4). Each
blue point represents the y intercept of a power equation and each stack of
points a different organelle view. The
figure suggests that the cellular mechanism responsible for defining the
organellar composition of cells operates by discrete steps (quanta) and that
the underlying principles may be related, as suggested by the similar ranges
and slopes of the exponential stacks.
For those readers interested in exploring the organellar organization of
cells, two questions quickly fall into focus.
When and why do organelles locate at specific locations in the
stack?

Figure 4. When the y intercepts of power equations are fitted to exponential equations, a global pattern of order can be seen. Connections between organelles appear to be ordered by rule.
Perspective: The repertoire library offers a global view
of structural order by connecting local equations. It shows that different animals can produce remarkably similar
parts, but that the proportion of these parts one to another can be either
similar or different. The fact that
different animals can share similar connections and similar proportions of
parts, suggests that they might also be sharing a similar genetic blueprint or
simply following a design strategy that leads to a similar phenotype. This means that each set of equations
attached to a specific part reflects the general rules guiding the
establishment of such relationships. In
other words, the equations would seem to offer a broad overview of a
fundamental organizing principle of biology.
If this proves to be the case, then it would not seem unreasonable to assume that we can assemble sets of equations for specific animals. Such an accomplishment would be accompanied by a substantial improvement in our ability to predict a large number of parts from one or a few seed values. Recall that the current form of the repertoire library makes no attempt to connect data across hierarchical levels organelles and cells are treated separately. However, the only factor limiting such connections seems to be the amount of data available. In the future, large farms of equations spanning many hierarchical levels are likely to become commonplace.
Ladder Equation Library
(Design Codes)
Last year, ladder
equations were reported for the data pair library (control vs. control). By increasing the number of entries in the
design code library (control vs. experimental) to 2,400, a similar albeit
provisional estimate was also made for the experimental data.
If we start with
the 2,400 design codes in the literature database, form ratios (structure
y/structure x), sort the ratios (ascending), and collect sets of ratios that
give power curves with an r2 = 0.9999, we can generate a set of 23
equations describing the design codes.
Since the slopes (a) of these power curves also tended to be close to
one, the y intercept (b) of each equation served to identify a unit of order. In turn, when the y intercepts were plotted
as if they were rungs on a ladder a single exponential equation of the
form y = exa the ladder equation appeared. This means that we can summarize the
experimental data set of more than 2,400 entries with a single exponential
equation having a r2 = 0.9991:
y = 0.1194e0.1354x
,
where y is the y
intercept of the power equation and x the rung number.
Analogy Library (Design Codes)
In biology,
interpreting the results of an experiment often includes reasoning by
analogy. To wit, resemblances imply
similarities. The analogy library takes
this convention one step further by employing power equations as mathematical
analogies. For example, if we detect a
specific amount of change in a structure (e.g., mitochondria) and want to know
where a similar change has occurred elsewhere, the library can provide such
information.
Drill-Down Library (Design Codes)
Typically,
a given design code is part of a larger code and, at the same time, consists of
many smaller codes (Bolender, 2003).
Recall that the design codes extend across the hierarchy of size as a
set of nested equations. As such, they
willingly serve as mathematical pathways to and from the genome.
The
drill-down library illustrates that complex design code equations can be
simplified by expressing them as two or more simpler equations (illustrated as
before and after graphs). This
coherency of equations is a fortunate relationship because it allows us to
define an experimental process mathematically as the passage through a
connected set of equations. In the
drill-down library, the direction of information flow is from the organism to
the gene across the hierarchical levels defined by the relational database
model (Bolender, 2001b). Theoretically,
a drill-down library can be used to find the genetic origin(s) of a biological
part.
Enterprise Biology
Software (2004)
The Enterprise
Biology Software package for 2004 updates the stereology literature database
through 2003, adds the repertoire, analogy, drill down, and ladder
equation (design code) libraries, upgrades applications, and includes a
progress report. Details of the upgrade
can be found in the installation instructions (BIOLOGYtabs 2004).
Stereology
Literature Database
Database Update: This year, data taken from submitted
reprints were added to the literature database and design codes were harvested
from an additional 165 papers.
Libraries
Previous libraries
were updated to reflect the recently entered data and new libraries were
generated from the data pair and design code files.
Searching Libraries: The data pair library includes two columns
of control data, whereas the design code library includes one column each of
control and experimental data. The
major purpose of these libraries is to generate equation libraries. Recall that:
Data Pairs (control
vs. control)
· Detect a connection between two structures,
two functions, or a structure and a function.
· Detect connections among several structures,
functions, and structures and functions.
· Compare control data coming from one or
several papers.
· Generate equations for predicting structure
and function.
· Identify patterns.
· Define repertoires for organelles and cells
with equations.
Design Codes
(control vs. experimental)
· Detect change quantitatively and qualitatively
as connected sets.
· Identify patterns of change.
· Generate equations for predicting changes in
structure and function.
· Use equations to search for analogies.
· Offer a drill-down approach for identifying nested
equations.
To simplify their
use, both data pair and design code libraries share a similar interfaces and
methods for generating equations (Figure 5).