In: Enterprise Biology Software, Version 4.0  © 2004 Robert P. Bolender                    

             

Enterprise Biology Software: V. Research (2004)

 

Robert P. Bolender

 

Enterprise Biology Software Project, P. O. Box 303, Medina, WA  98039-0303, USA

http://enterprisebiology.com

 


Summary

The Enterprise Biology Software Project explores new approaches to discovery in the life sciences by applying mathematics and technology to the biology literature.  This process includes (1) standardizing the literature by moving research data from the pages of journals into the tables of a relational database, (2) generating derived data libraries from the database, (3) searching these new libraries for mathematical patterns, and (4) distributing the databases, libraries, software, and observations freely to contributing authors - on a CD.  The current release includes an updated stereology literature database, new libraries (repertoire, analogy, drill-down, and ladder), a progress report, and several new findings.  In short, the libraries continue to uncover widespread patterns of mathematical order in biology.  These patterns appear not as properties of individual structures, but rather as connections between structures.  Connections, for example, define repertoires of equations that form networks.  In the report, we consider how these equations might help us decipher the genetic regulatory networks.

 

 


Introduction

 

Background

 

The results of the Enterprise Biology Software Project suggest that the mathematical core of biology is well hidden by several factors, including the bias of our experimental methods, the fluctuation of biological systems between equilibrium and nonequilibrium states, and the confounding nature of change.  Since the success of the project depends on accessing this mathematical core, all these issues had to be dealt with – locally and globally - in both control and experimental settings.  This was accomplished by storing published research data in a relational database and then generating libraries of standardized data there from (Bolender, 2001-2004). 

 

As the database grew, it became apparent that these new libraries (data pairs; design codes) could generate additional libraries consisting of equations.  The equation libraries created a new opportunity for exploring biology as a mathematical puzzle, one that could be solved – one step at a time – by following the clues accompanying each new library.  In effect, we now know how to translate the biology literature into collections of standardized equations.  The challenge for the reader – and for the writer as well – is to find the clues and then use them to solve additional pieces of the biology puzzle.  

 

How do we find the clues?  Last year, you may recall that the ladder equation library showed us that biological data – expressed first as data pairs and then as power equations – could be summarized by a single exponential equation.  This suggested that all the parts of biology might be ordered by a single rule.  The major clue to come from that exercise was that order could be found as power equations with r2  ≥ 0.999 by sorting the data pairs as ratios (y/x) – three or more rows at a time.  In other words, we learned how to convert data pairs into equations - quickly.  The next clue came from the observation that the rung equations grouped the data pairs at distinct levels, resembling quantum steps.  A similar clue came from the design code library when it showed us that change behaved – quite unexpectedly - like a constant.  Finally, the example of unfolding complexity by shifting our perspective from a zero to a one-dimensional platform provided the catalyzing clue. 

 

Armed with so many clues, the next piece of the puzzle was in easy reach.  Let us begin with the fourth clue, the one that recommended a shift in our perspective.  The promise here was that changing the platform for viewing data can change what we see.  Consider this.  What would happen if instead of looking at biology thorough our eyes we looked at the same biology through the “eyes” of one of its parts?  Although such a question seems a bit odd at first, it turns out to be a quite good one because it takes us to an interesting result.  If we assume – for convenience - that any given part sees all those parts to which it is connected mathematically, then we can share the perspective of that part by simply writing the appropriate equations.  The result becomes interesting to the larger community if the parts found to be connected mathematically turn out to have been produced by a similar genetic regulatory network.  If true, then we have found a relatively simple way of reverse engineering these otherwise elusive networks.  In short, the perspective clue - supported by the first three – produced a new library (repertoire), one that may be offering us our first glimpse of what organelles and cells may be “seeing.”  Along with our glimpse, however, comes a view of biology that – if confirmed – suggests a level of complexity for genetic regulation that may help to explain why genomes can do so much more than just code for proteins.

 

The other new libraries are more or less straightforward.  One uses equations to suggest a strategy for connecting larger structures to molecules and genes (drill-down), whereas the other (analogy) employs equations to hunt for similarities in the biology literature - mathematically.  Finally, a ladder equation library was added for experimental data.

 

 

Progress

 

Currently, the major activities of the Enterprise Biology Software Project consist of entering data and generating new libraries.  Research data submitted by authors were added to the database and sent back as part of a technology package, as shown below.

 

Start

 


Research Data (reprints contributed by authors)

 


Standardized Data (stored in a relational database)

 


Data Pair and Design Code Libraries (connected data sets; minimized bias)

 


Equation Libraries (mathematical patterns - local and global)

 


Enterprise Biology Software Package

 


Sent to Contributing Authors

 

First, data were moved from research papers into a relational database and standardized.  For each paper, all the data at a given hierarchical level of size (1 to 16) were then used to form pairs of data, which juxtaposed control vs. control (data pairs) or control vs. experimental (design codes).  These two basic data libraries were stored as database tables and used to generate the equations that populate the derived data libraries.  Writing equations consisted of filtering and sorting the data, sending the results to an Excel worksheet, and plotting the x and y values as power equations (y = bxa).  Finding these equations was simplified by sorting the ratio y/x from low to high, and looking for an r2 = 0.999 or better with three or more rows of data.  The resulting graphs were stored in Excel files and included with the software upgrade.  The libraries – old and new – are listed in Table 1.

 

Libraries:  Libraries serve as discovery platforms (Table 1).  They include one or more user interface screens, data, help files, and worked examples (e.g., Excel worksheets; case studies).  In effect, each new library helps to solve another piece of the biology puzzle. 

 

Table 1.  Enterprise Biology Software Libraries.

Library

Data

Entries

Applications

Standardized Stereology Literature

 

 

 

·     Citation – search

original

12,853

Find references

·     Citation – by paper – contl

original

1,024

Print paper – contl data

·     Citation – by paper – contl + exptl

original

6,438

Print paper – contl + exptl data

·     Methods – search SQL script

original

1,951

Find papers by methods

·     Control Data

original

15,521

 

·     Experimental Data

original

9,677

 

·     Contl data – by data point

original

12,164

Find data by data point; level

·     Contl+Exptl data – by data point

original

7,284

Find data by data point; level

·     Percentage change data

derived

7,018

Find data by change; level

·     Phenotype data

original

7,018

Find data across 14 levels

Connection Map

 

 

 

·     Type 1 (2str/2+points/1level/1paper)

derived

182

Find connections/minimize bias

·     Type 2 (2+str/2+points/1level/1paper)

derived

81

Find connections/minimize bias

·     Type 3(2+str/2+points/1+levels/1paper)

derived

323

Find connections/minimize bias

·     Type 4 (data pairs)

derived

22,445

Find connections/minimize bias

Data Replicator

 

 

 

·     One from one (data from 1 paper)

derived

702

Predict data

·     Many from one (data from 1+ papers)

derived

27

Predict data

Biological Algorithm

 

 

 

·     Connections upstream and down

derived

458

Predict organs and organisms

Data Pair

 

 

 

Global (data from 1+ papers)

derived

112

Find connections/minimize bias

Design Code

 

 

 

·     Local (data from 1 paper)

derived

2398

Identify and predict change

·     Global (data from1+ papers)

derived

58

Identify and predict change

Ladder Equation (Data Pairs)

 

 

 

·     Total data pairs

derived

25

Generalize structure in biology

·     Organ

derived

19

Generalize structure by organ

·     Cell

derived

19

Generalize structure by cell

·     Organelle

derived

22

Generalize structure by organelle

 

 

 

 

New for 2004

 

 

 

Repertoire (Data Pairs)

 

 

 

·     Organelles, inclusions, and cells

derived

771

Find connections; Make predictions

Ladder Equation (Design Codes)

 

 

 

·     Total design codes

derived

25

Generalize structure in biology

Analogy (Design Codes)

 

 

 

·     Selected design codes

derived

140

Look for similar changes

Drill-Down (Design Codes)

 

 

 

·     Selected design codes

derived

183

Simplify complexity

 

Figure 1 indicates that the stereology literature database currently includes 60,000 data entries, of which more than half represent derived data.  This community resource offers abundant opportunities for finding connections between and among the many parts that define biology.  

 

Figure 1.  Research data stored in the stereology literature database.

 

 

 

 

Results:  The principle findings of the project are listed below.

 

2001 to 2003

·    Biological data can be transferred from research papers to a relational database and standardized.

·    The production database demonstrates the feasibility of creating an electronic literature for the life sciences.

·    A connection model for research biology yields widespread mathematical patterns, whereas the traditional change model does not.

·    When stored in a database, published research data serve as a key resource for producing derived data.

·    Biological data are subject to an uncertainty principle and therefore carry an unknown bias.

·    Libraries can be designed that minimize bias (data pairs, design codes).

·    Structures in biology are connected by rule (connection model).

·    Algorithms can be written that generate organs and organisms from a single seed value.

·    Complex research data can be unfolded by viewing data from a higher dimension.

·    Relationships of structure to function can be expressed mathematically.

·    Change in biology can be generalized and predicted.

·    More than twenty thousand connections between structures in biology can be summarized by a single exponential equation. 

 

New for 2004

·    The organization of biological parts can be defined explicitly as repertoires of equations.

·    The repertoire library views connections from different structural perspectives by forming networks of equations. 

·    Experimental data can be summarized by a single exponential equation, as shown earlier for control data.

·    Mathematical analogies can encourage serendipity.

·    Drilling down into a data set can reveal the presence of nested equations.

·    Optimization may be a first principle of biology.

 

 

Repertoire Library (Data Pairs)

 

The repertoire library offers views of the same published data from many different perspectives by transforming tabular data into networks of equations.  These networks, which show how structures are connected mathematically, are displayed as collections of power equations.      

 

Organelles: If we imagine that each of the many parts of biology “sees” its world from a unique perspective, then what might we learn by sharing these different perspectives?  For example, what parts of a cell would be of the greatest interest to a mitochondrion?  One way of answering such a question is to list all those parts to which a mitochondrion has a mathematical connection.  In other words, we can define a mitochondrial perspective as family of power equations (y = bxa) wherein mitochondria (expressed as a volume or surface) would be the x variable:     

 

y(organelle i) = bx(mitochondria)a  .         

 

Generating these equations is a simple task, using the literature database.  Start with the data pairs table (included with the current EBS upgrade), type <mito> into the x name field, press Enter, click on the sort y radio button, and save the results as an Excel file.  In Excel, notice that all the terms in the x name column begin with mito…, whereas the organelles in the y name column carry different names – sorted alphabetically.  Start with Golgi and sort the x/y column numerically (low to high).  Next, highlight the first three data pairs of Golgi and select a scatter graph.  Change the x and y axes to logs and fit the points with a power regression line.  If the r2 is greater than 0.999, add extra points (row by row) to the graph until the r2 comes close to 0.999.  If not, move the calculation box down one row at a time until the r2 becomes greater than or equal to 0.999.  The power equation that appears on the graph describes a mathematical connection between mitochondria and Golgi.  Repeat the process for all the organelles and inclusions connected to mitochondria and the resulting set of equations becomes the mitochondrial perspective.  (Note that for each graph, as shown in Figure 2, there is the corresponding Excel worksheet.) 

 

             

                                    1) BIOLOGYtabs 4.4.                                                                      2) Excel worksheet.

Figure 2.  The repertoire library for mitochondria consists of a set of power equations.    Calculations can be viewed by opening an Excel worksheet.

 

In turn, these equations – shown above as lines – can be programmed as computed fields on a work screen wherein a given value for a mitochondrion will generate the expected values for all the connected structures. 

 

Figure 3.  The repertoire library can also display the equations – illustrated as regression lines in Figure 2 – as an interactive network of equations.  Apply a change to mitochondria and see how the connected organelles respond.

 

The first thing we can learn from this perspective is that mitochondria share connections with several cytoplasmic organelles.  Notice, however, that mitochondria display multiple connections with the same organelle – each one being identified by a unique power equation.  Similar patterns of order appear when perspectives are generated for the remaining organelles and inclusions.  In other words, the proportion of organelles – one to another – typically appears as a collection of fixed relationships.  In effect, the library includes a repertoire (catalogue) of mathematical connections for organelles that can occur across biology.  It tells us that organelles are connected by rule and these rules can be captured by equations.     

 

The repertoire library for organelles and inclusions contains 513 power equations typically with r2s equal to or better than 0.999.  If organelles maintain these stoichiometric-like proportions when an equilibrium state is reached after a change, then we can expect that a change in mitochondria will be accompanied by predictable changes in its connected neighbors.  However, each of its neighbors – be it an organelle or inclusion – has its own set of connections.  This means that an intricate web of connections could therefore translate the change we typically observe in a single organelle into a widespread effect.  In such a setting, change would enjoy an unexpected wealth of complexity.  Recall that complexity may be nature’s way of producing new emergent properties (Bolender, 2001).

 

Notice that Figure 3 presents us with a challenging problem along with a new set of clues.  If we imagine that the connections among the downstream products of a genetic regulatory network can serve as quantitative markers of that network, then how would we write a set of equations describing the network for a specific cell in a specific setting?  The solution to such a problem consists of selecting equations from each of the organelle columns that can describe – appropriately – a specific cell type in a specific setting.  In turn, this connected set of equations would be expected to predict the relative proportions of organelles – in the specific cell.  Our reward for solving this problem might be a compound mirror of equations with which to view earlier genetic activity – not unlike the way an astronomer looks at stars.  The question, of course, is how do we select the appropriate equation(s) from each of the organelle columns?

 

The answer may be as simple as moving from one dimension to another.  Consider the following experiment.  Estimate the densities of the eight organelles in the specific cell type and calculate the ratio y/x for each data pair to get the proportions of the organelles (y(organelle) = bx(mitochondion)a).  Finally, compare the new proportions to those in Figure 3 and select the closest match.  In effect, we can use Figure 3 as a lookup table.  More importantly, perhaps, we now know how to turn a zero dimensional data point (y/x) – without connections – into a one dimensional power equation – with connections.  In other words, by using the repertoire equations as a lookup table, we can transform an information poor data point into an information rich line – with surprisingly little effort.  Indeed, such dimensional shifts may offer a host of new products and clues.                

 

Cells: The repertoire library for cells identifies mathematical connections between cells and – as just described for organelles – the power equations display the step-like pattern.  However, the cell-to-cell relationships are complex.  The proportion of cells can be identified within and across species as (1) one cell to one cell or (2) one cell to many cells.  For example, the proportion of pulmonary endothelial cells to type II cells in the mouse, goat and rat is the same (one to one), but this same proportion is also shared between endothelial cells and fat-storing, interstitial, Kupffer, macrophage, mesenchymal, and glial cells (one to many).  The proportion is expressed as y = 0.2693x0.9973, where x = endothelial cell and y = cell i.  Such a result, which persists for many combinations of cells, suggests that proportions of cells are ordered by rule and that these rules are being conserved across species.  Although the genetic mechanism responsible for controlling cell proportions is unknown, at least we now know that such proportions exist and that they can be quantified.     

 

Global View:  The repertoire library was also used to look for general patterns of organization.  To generate global views for organelles and cells, the power equations were fitted to exponentials as described earlier for ladder equations (Bolender, 2003).  When plotted as a group (without regression lines) they offer a striking view of organelle connections (Figure 4).  Each blue point represents the y intercept of a power equation and each stack of points a different organelle view.  The figure suggests that the cellular mechanism responsible for defining the organellar composition of cells operates by discrete steps (quanta) and that the underlying principles may be related, as suggested by the similar ranges and slopes of the exponential stacks.  For those readers interested in exploring the organellar organization of cells, two questions quickly fall into focus.  When and why do organelles locate at specific locations in the stack?          

 

Figure 4.  When the y intercepts of power equations are fitted to exponential equations, a global pattern of order can be seen.  Connections between organelles appear to be ordered by rule.  

 

Perspective: The repertoire library offers a global view of structural order by connecting local equations.  It shows that different animals can produce remarkably similar parts, but that the proportion of these parts – one to another – can be either similar or different.  The fact that different animals can share similar connections and similar proportions of parts, suggests that they might also be sharing a similar genetic blueprint – or simply following a design strategy that leads to a similar phenotype.  This means that each set of equations attached to a specific part reflects the general rules guiding the establishment of such relationships.  In other words, the equations would seem to offer a broad overview of a fundamental organizing principle of biology. 

 

If this proves to be the case, then it would not seem unreasonable to assume that we can assemble sets of equations for specific animals.  Such an accomplishment would be accompanied by a substantial improvement in our ability to predict a large number of parts from one or a few seed values.  Recall that the current form of the repertoire library makes no attempt to connect data across hierarchical levels – organelles and cells are treated separately.   However, the only factor limiting such connections seems to be the amount of data available.  In the future, large farms of equations spanning many hierarchical levels are likely to become commonplace. 

 

 

Ladder Equation Library (Design Codes)

 

Last year, ladder equations were reported for the data pair library (control vs. control).  By increasing the number of entries in the design code library (control vs. experimental) to 2,400, a similar – albeit provisional – estimate was also made for the experimental data.   

 

If we start with the 2,400 design codes in the literature database, form ratios (structure y/structure x), sort the ratios (ascending), and collect sets of ratios that give power curves with an r2 = 0.9999, we can generate a set of 23 equations describing the design codes.  Since the slopes (a) of these power curves also tended to be close to one, the y intercept (b) of each equation served to identify a unit of order.  In turn, when the y intercepts were plotted – as if they were rungs on a ladder – a single exponential equation of the form y = exa – the ladder equation – appeared.  This means that we can summarize the experimental data set of more than 2,400 entries with a single exponential equation having a r2 = 0.9991:

 

y = 0.1194e0.1354x ,

 

where y is the y intercept of the power equation and x the rung number.

 

 

Analogy Library (Design Codes)

 

In biology, interpreting the results of an experiment often includes reasoning by analogy.  To wit, resemblances imply similarities.  The analogy library takes this convention one step further by employing power equations as mathematical analogies.  For example, if we detect a specific amount of change in a structure (e.g., mitochondria) and want to know where a similar change has occurred elsewhere, the library can provide such information. 

 

 

Drill-Down Library (Design Codes)

 

Typically, a given design code is part of a larger code and, at the same time, consists of many smaller codes (Bolender, 2003).  Recall that the design codes extend across the hierarchy of size as a set of nested equations.  As such, they willingly serve as mathematical pathways to and from the genome. 

 

The drill-down library illustrates that complex design code equations can be simplified by expressing them as two or more simpler equations (illustrated as before and after graphs).  This coherency of equations is a fortunate relationship because it allows us to define an experimental process mathematically as the passage through a connected set of equations.  In the drill-down library, the direction of information flow is from the organism to the gene – across the hierarchical levels defined by the relational database model (Bolender, 2001b).  Theoretically, a drill-down library can be used to find the genetic origin(s) of a biological part.    

 

 

 


 

Methods and Results

 

 

Enterprise Biology Software (2004)

 

The Enterprise Biology Software package for 2004 updates the stereology literature database through 2003, adds the repertoire, analogy, drill down, and ladder equation (design code) libraries, upgrades applications, and includes a progress report.  Details of the upgrade can be found in the installation instructions (BIOLOGYtabs 2004).

 

 

Stereology Literature Database

 

Database Update: This year, data taken from submitted reprints were added to the literature database and design codes were harvested from an additional 165 papers.    

 

 

Libraries

 

Previous libraries were updated to reflect the recently entered data and new libraries were generated from the data pair and design code files.

 

Searching Libraries: The data pair library includes two columns of control data, whereas the design code library includes one column each of control and experimental data.  The major purpose of these libraries is to generate equation libraries.  Recall that:

 

Data Pairs (control vs. control)

 

·    Detect a connection between two structures, two functions, or a structure and a function.

·    Detect connections among several structures, functions, and structures and functions.

·    Compare control data coming from one or several papers.

·    Generate equations for predicting structure and function.

·    Identify patterns.

·    Define repertoires for organelles and cells with equations.

 

Design Codes (control vs. experimental)

 

·    Detect change quantitatively and qualitatively as connected sets.

·    Identify patterns of change.

·    Generate equations for predicting changes in structure and function.

·    Use equations to search for analogies.

·    Offer a drill-down approach for identifying nested equations. 

 

To simplify their use, both data pair and design code libraries share a similar interfaces and methods for generating equations (Figure 5).