In: Enterprise Biology Software, Version 6.0  © 2006 Robert P. Bolender                            

 

Enterprise Biology Software: VI. Research (2006)

 

Robert P. Bolender

Enterprise Biology Software Project, P. O. Box 303, Medina, WA  98039-0303, USA

http://enterprisebiology.com

 


Summary

 

The Enterprise Biology Software Project explores challenging questions in the life sciences by looking for answers in the biology literature.  Since many of these questions can be translated into mathematical puzzles, they can be solved – for the most part - by generating empirical equations.  In turn, these equations can produce new and more difficult puzzles often extending to the mathematical core of biology.  For example, the progress report this year describes how we can solve one of these harder puzzles by first separating two interacting complexities and then unfolding each complexity in its turn.  In effect, the question: “How do we reverse engineer biology?” becomes a double puzzle: “How do we separate two complexities (methods and biology) and how do we unfold and refold them?”  A solution to the methods puzzle becomes a Universal Biology Database, which in turn can be used to solve the biology puzzle by characterizing biological phenotypes with equations.  Such phenotypes can be represented either as a single equation or as a stack thereof.  When summarized across biology, these equations offer a glimpse of the core by revealing the mathematical organization of the parts as a biological blueprint.  There is, however, one small surprise.  In contrast to the widely held view that biology is largely a nonlinear system, most of these phenotypes turn out to be linear.  This finding is most welcome in that it greatly simplifies the task of reverse engineering biology.  By including a query by example interface, the 2006 software package offers ready access to this new data-driven biology and may become one of our technology tickets to the future.  Imagine – if you will - a time when each research paper becomes a puzzle with a unique mathematical solution, where biology operates according to a rulebook, and we can read from that book.  Interested?  I invite you to run the software package. 

 


 

Introduction

 

Biology plays by the rules.  By discovering these rules and also playing by them, we improve our chances for success – in whatever we attempt.  This defines a central strategy of the Enterprise Biology Software Project.  Indeed, this process of discovery based on solving mathematical puzzles is wonderfully simple and can be applied by everyone.  All we have to do is follow the clues to find the rules that solve the puzzles.  Each solution, of course, generates a new collection of clues and the process continues from one puzzle to the next.    

 

This rule-based approach has yielded two key pieces of information.  We now know that repertoire equations can unfold complexity into a well-defined order, wherein these power equations have coefficients of determination approaching one (r2~1).  Moreover, these equations often display an exponent a (Y=bXa) that also tends toward one.  This means that the repertoire equations effectively become linear (Y=bX). 

 

Since the repertoire equations define order in biology as the proportion of two parts X and Y, we can assemble enough information to begin the process of reverse engineering biology.  Moreover, we can design a Universal Biology Database with virtually unlimited scaling properties by simply arranging the ratios of the parts (Y/X) in numerical order and assigning them to decimal steps.  When fitted to regression curves, these data pairs become decimal repertoire equations.  The database becomes universal because it can accept data from all disciplines capable of forming ratios of structures, which of course includes molecules.  By storing data from many different disciplines in the same database table, the data become integrated - mathematically – across disciplines, animals, settings, et cetera.

 

The software is full of surprises.  When running the program from tab 8 of the Universal Biology Database, for example, we can easily compare the proportions of the same two structures in control and experimental settings.  I was surprised by how often controls and experimentals share the same decimal repertoire equation – even when the absolute amounts differ.  This tells us that biology prefers to make more or less of the same thing, rather than making something different.  Changes in the proportions of structures, however, can be readily found in development, disease, and aging.  Perhaps the most reassuring finding of all, however, is the fact that many different labs routinely produce the same results (i.e., the experimental data share the same decimal repertoire equation(s)).  This shows that both the methods and the researchers applying them are indeed capable of generating reproducible results. 

 

The new software package comes with a new learning curve, but one that has a very gentle slope.  Specifically, by adding a “query by example” interface we can write and send complex instructions to the database.  This provides full access to the information with great ease and flexibility.  Simply working through the examples should be all that is required to become comfortable with the new format.

 

Although we’ll consider these topics later in this report, a brief introduction here may be helpful.  Why do we want to reverse engineer biology?  The answer is painfully simple.  It encourages biology to become a data-driven science.  Recall that chemistry is a product of physics and that biology a product of both physics and chemistry.  Chemistry, for example, is very effective as a data-driven science because it plays by a set of stoichiometry rules.  We know these rules and can use them to explain how small parts (atoms) are connected to form larger molecules and compounds and to write balanced equations for chemical reactions.  In other words, we have access to the mathematical core of chemistry.  Do analogous rules and balanced equations exist for biology?  Yes, of course they do.  The only difference is that the rules of biology are much better hidden than those of chemistry and must be extracted from the research literature with the help of mathematics and technology.  What evidence supports this claim?  Biological stereology can provide the balanced equations and a Universal Biology Database the stoichiometry of the parts.  In effect, we can produce a biological blueprint that shows how the parts of biology larger than molecules are connected by rule.  As such, this table provides the first of many new access routes to the mathematical core of biology – exactly what we would expect from a data-driven science.     

 


 

Methods and Results

 

 

The main purposes of the Progress Report this year are to introduce (1) a new interface for the literature database and (2) to explore an engineering model (forward and reverse) for biological research.  As always, the best introduction to the software package comes from running the programs. 

 

 

Extended Database Model

 

The central feature of the first universal database was a single table that could accommodate both control and experimental data – even when coming from many different disciplines (Bolender, 2005).  Effectively, it defined a simple - yet powerful - way of integrating diverse data within and across research fields.  Unfortunately, most of the details – defined as data entry fields in the original databases - were no longer available.  The new database remedies this shortcoming by connecting all three databases into a single operational unit. 

 

 


By adding a “query by example” interface to the Universal Biology Database 2, even the beginner can quickly learn to write intricate SQL (structured query language) scripts and submit them to the database. 

 

 

 

Enterprise Biology Software for 2006

 

 

This year the software is being distributed on a mini CD, one that fits conveniently in a standard envelope.  The CD includes a relational database, a runtime database engine, and a small collection of programs and documents.   

 

 

The main menu of the program includes the progress report and an index of programs.  Click on an item in the list to view it. 

 

 

1.     Progress Report: A pdf file (this document).

 

2.      Universal Biology Database: The database folder includes eleven tabs - six of which can be used to run programs.  A central feature of the new software consists of a query by example interface.  It allows us to define a database query by simply selecting items from lists or by entering a word (or part thereof), number, or a collection of words or numbers into one or more data entry fields.  As each selection is made, the SQL script – often shown at the bottom of the screen - is modified accordingly.  When completed, the set of instructions (query) is sent to the database by clicking on the Retrieve button.  The results can be viewed, printed, stored as a file, or sent to an Excel worksheet.

 

 

Tab 1 – Welcome: Click on <Read> to view the objectives.

 

 

Tab 2 - Introduction: Click on <Read> for a brief introduction to the database.

 

Tab 3 - Background: Click on <Read> for an explanation of why stereological data can create a foundation for the Universal Biology Database.

 

Tab 4 – UBD data table: Select tab 4 and click on run.  This screen displays the main table of the Universal Biology Database 1.0 updated to 2.0.  Green identifies control data and blue experimental.  In the lower left hand corner of the screen – just to the left of the horizontal scrolling arrow – note the thick black line.  Using the mouse cursor, drag it to the right to produce a split screen.  The Excel worksheet used to calculate the decimal repertoire equations can be found at C:\Program Files\EBSTicket 2006\Files\2006_regression_equations_01.xls).

 

 

Tab 5 – UBD control data: Select tab 5 and click on run.  The screen displays the new query by example interface.  Notice that the same data fields appear under two different headings – data catalogue (left panel) and query by example (right panel).  Use the catalogue to discover what the database contains and the query panel to assemble a set of instructions for the database.  Numerous examples of query criteria appear in the drop down lists, illustrating the remarkable flexibility of this approach.  Simply work through some of the examples to see how the interface works.

 

 

A few introductory comments, however, may be helpful.  Items selected from either panel – catalogue or query – can be used in a search.  Items selected from the catalogue will retrieve all those items identical to the one chosen.  For example, selecting <bird> from the catalogue panel and clicking on the Retrieve button produces 58 responses.  In contrast, typing <like %bird%> into the organism field of the query panel produces 110 responses.  In the first case, only those rows containing just the word <bird> were retrieved, whereas in the second case all those entries containing the word <bird> alone or in a word string were retrieved.  Furthermore, note that entries coming from the catalogue panel are case sensitive (upper and lower case), whereas those from the query panel are not.

 

Let’s try a simple query together.  The objective of our search will be to retrieve all the data coming from the CA1 region of the control hippocampus.  As shown below, we have written our request in the X Name field <like %ca1%>, using the query panel.  Notice how the SQL script in the box at the bottom of the screen reflects our choice.

 

 

Clicking on the Retrieve button yields the result below, which includes a collection of 920 screens.

 

 

To view these data as a scrolling screen instead, press the view data button and the following screen appears.

 

 

When we scroll to the end of this table, two graphs appear.  The first shows a log-log plot of the X, Y data and the second a histogram of the decimal repertoire equations.  The log-log plot displays the published data of CA1 as a mathematical puzzle.  Finding a solution consists of using regression analysis to fit the points to power curves that carry coefficients of determination (r2) close to one.  This process generates a family of power curves – called repertoire equations - that define how CA1 is related to itself and to other structures in the brain.  Alternatively, the process can be simplified.  Since each data ratio in the Universal Biology Database is attached to a decimal repertoire equation, a collection of ratios automatically becomes a stack of equations.  Notice the distinct steps in the histogram of second graph.  They represent the repertoire equations as decimal steps and illustrate the total range of connections available to the CA1 region of the control hippocampus.  Recall that each connection also defines the proportion of the two parts, X and Y.        

 

 

To illustrate the process of finding equations, we can send this data table to an Excel worksheet and then do some curve fitting.  This is accomplished by clicking first on the to excel button and then on the from excel button.  To keep a copy of this worksheet, change the name from working.xls to something else.  Otherwise, the file will be written over the next time you click on the to excel button (To set the Excel path, click on INSTALL READER in the main menu).    

 

 

Using the graphing tools of Excel, we can readily express the connections between CA1 and all other structures as a family of power curves.  In the example below, the regression curve is calculated for those points belonging to the 0.5 decimal repertoire equation.  This equation tells us that two parts of structure X are connected to one part of structure Y (Y/X = ½ = 1:2 = 0.5), where the parts can be numbers of cells or volumes of compartments.  Notice that the r2 is indeed close to one (0.9999).  Generating these repertoire equations is a first step toward working out how structures are related to one another.  Once we know these relationships, the task of writing equations for structures all across the biological hierarchy of size becomes routine.  Recall that a solution to one of the puzzles last year consisted of writing the repertoire equations for the hippocampus and then connecting them to produce a network (Bolender, 2005).  This allowed us to predict the many parts of the hippocampus from a singe value – for five different animal species – in health and disease.  In short, the network of repertoire equations – produced by reverse engineering the hippocampus – illustrates how these equations can be used in developing algorithms for diagnosis and prediction.    

 

 

Tab 6 – UBD experimental data: Select tab 6 and click on run.  Here the interface is the same as the one just described for the control data.  Therefore, a brief example will suffice.  Let’s look for all the data on schizophrenia in males over the age of 60.  The screen below shows how we access these data, using our query by example interface.

 

 

The new puzzle shown below can be solved for the decimal repertoire equations, which in turn can be used to compare schizophrenia to males of other ages, to females, or to other diseases of the central nervous system.  Alternatively, we can search for potentially related structures simply by viewing the contents of an individual decimal step in the data table.

 

 

Tab 7 – UBD control + experimental data: Select tab 7 and click on run.  Here, the Universal Biology Database table includes all the control and experimental data.  Let’s try another query.  What is the effect of aging on mitochondria?  For the X structure we can type in <like %aging%> and for the Y structure <like %mito%>.   Our result is a collection of control and experimental points that can be evaluated either by inspection or by fitting the points to log-log regression lines (power curves). 

 

Tab 8 – UBD connection repertoire - Blueprint 1.0 - Control Data: The figure below illustrates the biological blueprint as a three-dimensional plot.  It represents a collection of equations defining the proportions of one structure to another.  In effect, it gives us an empirical view of the mathematical core of biology.   

 

 

Select tab 8 and click on run.  The connection repertoire table uses the decimal repertoire equations – expressed as proportions of whole numbers - to show how biological parts are connected by rule.  In effect, it provides a structural blueprint for biological parts larger than molecules in terms of a well-defined stoichiometry.  This mathematical overview of biology shows how structural connections define phenotypes and can provide insights into how, when, and where these phenotypes change. 

 

The connection repertoire table shown below identifies the structures in an X,Y pair and shows how they are connected quantitatively to each other and to related structures.  To be included in the table, the same data pair must appear at least three times in the database.  This represents a rigorous test of both the methods and the investigators in that the three data pairs typically come from three different papers.     

 

 

Several distinct patterns quickly emerge from this table.  A given pair of structures (X,Y) can display several distinct phenotypes, characterized as a multiple of whole numbers (X:Y).  For example, the proportion of mitochondria to peroxisomes can be 10:1, 20:1, and 33:1.  Notice also that different pairs of structures can share similar proportions.  This overlap can be used to link the equation associated with each data pair into a local or global network of equations.  Such networks provide a substrate for connecting other data types and become a platform for diagnosis and prediction.  Table 1 shows the data set that would be the starting point for assembling such a network for peroxisomes.

 

Table 1 Equations representing the proportions of organelles.

Data Pair Proportion

Decimal Repertoire Equation

Mitochondrion:Peroxisome

 

·          33:1

Y=0.03493X0.9999

·          20:1

Y=0.05459X0.9999

·          10:1

Y=0.12318X0.9999

·          25:1

Y=0.04442X0.9999

·          25:2

Y=0.08504X0.9998

·          14:1

Y=0.07455X0.9999

·          17:1

Y=0.06483X1.000

Nucleus:Peroxisome

 

·          5:1

Y=0.22397X0.9999

·          10:1

Y=0.12318X0.9999

·          7:1

Y=0.17464X0.9998

·          14:1

Y=0.07455X0.9999

·          5:2

Y=0.4468X0.9999

Lysosome:Peroxisome

 

·          3:1

Y=0.34610X0.9998

·          5:3

Y=0.64920X0.9998

·          3:2

Y=0.74784X0.9999

·          1:1

Y=1.19840X0.9996

·          1:2

Y=2.22114X0.9999

·          1:3

Y=3.44783X0.9998

Golgi:Peroxisome

 

·          5:1

Y=0.22397X0.9999

·          5:2

Y=0.44680X0.9999

·          3:2

Y=0.74784X0.9999

·          5:4

Y=0.84873X0.9999

·          1:1

Y=1.1984X0.9996

·          1:3

Y=3.44783X0.9998

Lipid Droplet:Peroxisome

 

·          10:1

Y=0.12318X0.9999

·          5:2

Y=0.44680X0.9999

·          5:4

Y=0.84873X0.9999

·          2:3

Y=1.72417X0.9999

·          1:10

Y=12.1598X0.9997

·          1:2

Y=2.22114X0.9999

·          1:3

Y=3.44783X0.9998

 

Why is access to the connection repertoire blueprint important?  Recall that biology often uses a remarkably similar genome to produce a great variety of different animal species.  Given our current understanding, it appears that we are a product of at least two interacting forces: our genes and the way they and their products produce and assemble our parts. 

 

The decimal repertoire equations suggest that biology has evolved a common parts inventory that it draws from when assembling people, mice, frogs, or fish.  The connection repertoire table allows us to explore phenotypes as a function of their basic building blocks, namely the decimal repertoire equations.  By defining phenotypes mathematically, we can study their life history in a given species and detect departures from what is expected to be normal.  The table also moves us closer to the genome.  When, for example, the proportions of the parts match the proportions of their constituent molecules, we can predict one from the other.  Of course, we might discover that some decimal repertoire equations can be explained simply by determining the number of duplicate genes being read at a given time or that exist in a given species.  Think of it this way.  If genes individually cannot determine a species, then perhaps the number of copies of a given gene can.   

 

If we summarize the connection repertoire table with a histogram, then the full range of phenotypic expression in biology can be seen.  Notice that practically all the connections can be captured with only about 50 equations, with far fewer doing most of the work.  The graph below shows that the connections between the parts tend to define five major peaks, each showing a clear preference for a specific proportion. 

 

 

When we focus on the connections of a single structure, such as the mitochondrion, a slightly different pattern appears.  Although this organelle uses decimal repertoire equations from fewer peaks, the positions of the peaks remain more or less the same as they appear in the total data set.

 

 

Finally, we can use the connection repertoire to make a few preliminary observations as to the biological preferences.  Of the total entries (4,296), roughly 40% occur in six decimal repertoire equations (Table 2).   

 

Table 2 Decimal Repertoire - Total Data Set – Most Popular Equations and Proportions

Decimal Repertoire Equation

Sum

%

Proportion (X:Y)

0.02

106

6.5

50 to 1

0.1

237

14.6

10 to 1

0.3

296

18.3

3 to 1

1.0

469

29

1 to 1

1.5

311

19

2 to 3

10

200

12

1 to 10

 

When we consider just counts of neurons (Table 3), we find that almost 70% of the connections occur in six decimal repertoire equations that define only five proportions: 3 to 1, 2 to 1, 3 to 2, 1 to 1, and 2 to 3.  Notice that the proportions are largely ratios of small whole numbers – curiously reminiscent of biochemical stoichiometry and the law of multiple proportions. 

 

Table 3 Decimal Repertoire – Numbers of Neurons – Most Popular Equations and Proportions

Decimal Repertoire Equation

Sum

%

Proportion (X:Y)

0.3

39

17

3 to 1

0.5

18

8

2 to 1

0.7

25

11

3 to 2

0.9

24

11

       ~ 1 to 1

1.0

91

40

1 to 1

1.5

30

13

2 to 3

  

Since both the central and peripheral nervous systems rely on tandem connections between neurons, disrupting cell proportions at any level may generate a variety of predictable consequences – upstream and down.  Unintended consequences also exist.  Last year, for example, the connection matrix for the lateral geniculate nucleus uncovered the disturbing fact that altering the genome of mice – at locations considered unrelated to the nervous system – can actually change the proportions of cells in the brain (Seecharan et al., 2003; Bolender, 2005). 

 

Tab 9 – UBD change: Select tab 9 and click on run.  The change data come from the design codes described previously (Bolender, 2003-2005).  In this screen, X identifies control or experimental data and Y experimental.  A ratio >1 indicates an increase (red), <1 a decrease (blue), and =1 no change (green).  Here the proportions are largely ratios of small whole numbers – once again reminiscent of the law of multiple proportions.  Finally, bear in mind that these decimal repertoire equations belong exclusively to change data.  

 

Let’s use this screen to see what exposures can change the hippocampus. Type <like %hippo%> into the X Structure field.  Click on the Retrieve button and then on the show data button.  The screen below identifies the direction of the change and conditions responsible.  Such a sort provides insights into the repertoire of change available to the hippocampus.  Notice, for example, how different conditions can produce both similar and different responses – in similar and different species.  For further information, see Puzzle 2: The Hippocampus, (Bolender, 2005) 

 

 

Scroll to the bottom of the screen.  Notice in the distribution histogram that most parts of the hippocampus change only slightly or not at all – a general pattern that persists throughout the nervous system (Bolender, 2004-2005). 

 

 

Tab 10 – Discussion: Click on <Read>.

Tab 11 – Resources, etc: Click on <Read>.

 

3.      Citation: The citation screen includes one of several support screens displaying data from the Stereology Literature Database.  It can be used to find references using a variety of approaches.  For example, to view the effects of follicle stimulating hormone (FSH), the following query finds all the FSH papers stored in the database.

 

 

Click on the view data button to browse the result set one paper at a time. 

 

 

4.     Citation List: This table includes a comprehensive list of publications for biological stereology.  To run a search, type a word or citation number into a data entry field and press Enter.  To read an abstract or the full text of a paper online, click on the abstract button and follow the instruction that appear.  

 

 

5.     Method: Use this screen to find all those papers using one or more methods.

 

 

 

6.     Find papers from data: Papers can be found from numerical data and their attributes.  Notice that individual screens are provided for control and experimental data.

 

 

 

7.     Catalogue of original data:  The data catalogue (Stereology Literature Database) included with the earlier releases of BIOLOGYtabs (Bolender, 2001-2005) now comes with a query by example interface.  This means that all the items in a database table – including both text and numerical data - can be searched at all levels of the hierarchy.  Both control (green) and experimental (blue) tables are included.  Select a screen by clicking on the name of the hierarchy level – listed at the left of the screen.

 

 

8.     Enter new universal data:  Pairs of data sharing similar references – wherein the reference cancels - can be entered into the Universal Biology Database.    For example, data recently acquired in your lab can be entered, assigned to a decimal repertoire equation, analyzed, and compared to data previously published.  Recall that to enter data, we begin by clicking on the add button, enter the data, and finally store the data in the database by clicking on the update button  

 

 

9.     Dictionary:  A collection of definitions offers help with the terminology.  Enter a word – or part thereof – and press the Enter key.

 

 


 

Discussion

 

By building and testing a Universal Biology Database, we begin the process of understanding how technology can produce a single source of information, one common to all biological disciplines and at the same time capable of minimizing methodological bias and animal variability (Bolender, 2004-2005).  Seamless integration of data across disciplines will become a critical asset as we begin to tackle the many problems of complexity associated with reverse engineering biology.  Challenges in the basic and clinical sciences – long considered intractable - will become increasingly manageable as we allow the fundamental design principles of biology to play a larger role in our experimental strategies.  The progress report encourages this approach by suggesting ways of applying reverse engineering techniques to uncover these principles.  

 

Technology can expand our options.  As we begin to describe diseases as stacks of equations, for example, we will have a new and effective way of monitoring their progression.  Moreover, equations derived empirically from the biology literature can serve as pointers to the molecules and genes involved in the onset of pathology and contribute a rigorous approach to evaluating both treatments and the recovery process.  Consider the impact of such a resource.  While following changes in the individual molecules of a few cells may be both interesting and worthwhile, only a broader knowledge of the changes in the relationships of many parts – large and small - can offer a meaningful approach to something as complicated as biology.

 

 

Reverse Engineering – A Research Model for Biology

 

As biology becomes a data-driven science, reverse engineering becomes a standard research model.  An experiment can be run at any hierarchical level, but the results will be automatically connected to previously published data – at all levels above and below.  To explain the underlying causes of a result, we can drill down and look at the behavior of the smaller parts.  Alternatively, to explain the broader consequences, we can look at the behavior of the larger parts in the higher levels.  In effect, our experimental results will seed networks of equations that can take us to wherever we wish to go.

 

Recall that reverse engineering is the process of analyzing a completed structure to identify its parts and their interrelationships.  In other words, it is the practice of figuring out how a product is made by taking it apart.  In general, however, we as biologists tend to focus our attention primarily on taking things apart, rather than on putting them back together.   

 

The Enterprise Biology Software Project is attempting to do both.  By taking biology apart mathematically with equations (reverse engineering), we can use these same equations to put everything back together (forward engineering).  For example, assembling the equations for the hippocampus last year demonstrated the feasibility of predicting the structure of the hippocampus – species by species (Bolender, 2005).  In fact, anyone can now do it.  Run the hippocampus software, enter a single seed value into a network of equations, and you can forward engineer a hippocampus.       

Reverse engineering biology makes sense as a research model because it solves many of the problems created by complexity.  By taking things apart and then reassembling them, we can observe the effects of unfolding and refolding.  The advantage of folding mathematically is that we can see when, where, and how complexity forms.  Moreover, this strategy works consistently at every level of complexity.  Recall that all the control and experimental data could be summarized by just two exponential equations, which were assembled from power equations (Bolender, 2003-4).  In turn, these power equations became linear when expressed as decimal repertoire equations (Bolender, 2005).  In this case, unfolding a global complexity consisted of going from exponential to power to linear – from complex to simple. 

A practical example may be helpful.  Let’s see how we can turn a seemingly hard question into an easy one.  For example, “How do we reverse engineer a disease?”  Go to tab 7 of the Universal Biology Database, select  <like %schizo%>, click on the Retrieve button, and a stack of 58 equations appears (in 855 rows).  The equations describe the schizophrenia phenotype(s) - mathematically.  The stack contains information about the relationships of the parts, their proportions, and where and when they can change.  By comparing the disease phenotype to its control, we can begin to figure out what went wrong – all across the biological hierarchy of size.  More importantly, such detailed information may eventually allow us to diagnosis the onset of a disease at a time when intervention is the most effective. 

Thus far, the reverse engineering effort has progressed only as far as organelles and a few molecules.  We still have to populate the molecular level of the database before we can begin to interact with the genes.  Fortunately, molecular biologists are already actively engaged in reverse engineering projects and the Universal Biology Database may at some point become helpful to them (see the report of the NYAS DREAM Project at http://www.nyas.org/dream).  There is, however, a small problem.  Molecular biologists tend to work with cells in vitro and therefore these phenotypes may or may not resemble their counterparts in vivo.  In fact, we may discover that the results of biological stereology and molecular biology routinely experience different realities.  This creates an interesting opportunity.  Since the data of biological stereology can serve as a gold standard in both in vivo and in vitro settings, it may offer a natural bridge between molecular biology and many other research disciplines.         

 

 

Reverse Engineering – Linear vs. nonlinear

 

When looking for patterns in biological data with regression analysis, a common finding is a power function (y=bxa).  Indeed this observation supports the widely held view that biological systems are largely nonlinear.  It would therefore seem helpful to understand how a biological system becomes nonlinear, because in doing so the system becomes far more complex.  What, for example, does biology gain from being nonlinear?

 

Let’s begin by looking at a nonlinear connection between two biological parts.  The figure below identifies the relationship between Golgi (X) and mitochondria (Y).  As expected, the relationship is nonlinear as shown by a power curve (Y = 2.9965X0.8305) with an R2 = 0.8214.  To be linear, the exponent a would have to be one, not 0.8305.  Note that this regression curve was generated with the Excel spreadsheet – only for the purpose of illustration.  When the exponent a does not approach 1.0, a statistical method other than the one used in Excel should be considered.        

 

 

We now know that such nonlinear relationships often occur because of the inherent complexity of our data set.  When unfolded, this nonlinear equation can be explained in terms of an underlying simplicity consisting of several linear-like equations.  The figure below illustrates this point with a few examples of unfolding the organelle data into repertoire equations.  Notice that the exponent a is now very close to 1.0 – the condition for linearity.

 

 

Although the r2s of the four equations illustrated above approach 1, the remaining equations might not because the samples are small and unrepresentative.  To mitigate this sampling deficiency, we can upgrade these repertoire equations to decimal repertoire equations, which draw their support from a larger data set.  This solution appears in Table 4.  Now all the exponents a become effectively equal to one, which means that the proportion of structures Y to X is linear – as defined by the decimal repertoire equations.  Why is this important?  A linear interpretation (Y=bX) of paired data sets in biology becomes a highly desirable feature when we wish to generate reliable and cost-effective software for two of the most useful products of reverse engineering - diagnosis and prediction. 

 

Notice in Table 4 that it actually takes twenty-seven equations to explain the complexity of the relationship of Golgi to mitochondria – not just one.   In other words, one nonlinear equation does an excellent job of obscuring twenty-seven linear equations. 

 

Now, let’s put Table 4 to work.  Why, for example, do Golgi and mitochondria seem to favor the proportions 2 to 5 (equation 2.5) and 1 to 20 (equation 20)?  What is the underlying genetic mechanism of such favoritism?  Can we design an experiment to answer this question?  To get a proportion of 2 to 5, for example, the expression of the genes and gene products defining the Golgi and mitochondria must be guided by a 2 to 5 rule.  Is the rule simply hard coded in our genes or is it the product of several factors?  We already know from the database where and when the 2 to 5 rule is being applied and can lookup the molecular composition of these organelles in the biochemistry literature.  What we need is an experimental approach that can assay these molecules and relate their appearance to specific genes and gene products.  In turn, this might tell us something about where the rules come from or how, when, and where they are being applied.  If we can answer this question, then we might be on our way to understanding why similar genomes can produce such different species.

Table 4 Decimal repertoire equations for the relationship of Golgi to mitochondria.

Structure X

Structure Y

DR Equation Nu

X:Y

Frequency

DR Equation

R2

Golgi

Mitochondrion

0.25

4:1

2

Y=0.27299X1.000

0.9999

Golgi

Mitochondrion

0.3

3:1

4

Y=0.34618X1.000

0.9999

Golgi

Mitochondrion

0.5

2:1

3

Y=0.54542X1.000

0.9999

Golgi

Mitochondrion

0.6

5:3

2

Y=0.64920X0.999

0.9999

Golgi

Mitochondrion

0.8

5:4

7

Y=0.84873X1.000

0.9999

Golgi

Mitochondrion

0.9

9:8

2

Y=0.94680X1.000

0.9999

Golgi

Mitochondrion

1.0

1:1

8

Y=1.19840X0.999

0.9999

Golgi

Mitochondrion

1.5

2:3

7

Y=1.72417X0.999

0.9999

Golgi

Mitochondrion

2.0

1:2

9

Y=2.22114X1.000

0.9999

Golgi

Mitochondrion

2.5

2:5

14

Y=2.72862X1.000

0.9999

Golgi

Mitochondrion

3.0

1:3

4

Y=3.44783X1.000

0.9999

Golgi

Mitochondrion

4.0

1:4

6

Y=4.43500X1.000

0.9999

Golgi

Mitochondrion

5.0

1:5

4

Y=5.43461X1.000

0.9999

Golgi

Mitochondrion

6.0

1:6

4

Y=6.48703X1.000

0.9999

Golgi

Mitochondrion

7.0

1:7

1

Y=7.45264X1.000

0.9999

Golgi

Mitochondrion

8.0

1:8

2

Y=8.47694X1.000

0.9999

Golgi

Mitochondrion

9.0

1:9

3

Y=9.46970X0.999

0.9999

Golgi

Mitochondrion

10.0

1:10

6

Y=12.1598X0.999

0.9999

Golgi

Mitochondrion

15.0

1:15

6

Y=17.1766X0.999

0.9999

Golgi

Mitochondrion

20

1:20

14

Y=24.2499X0.999

0.9999

Golgi

Mitochondrion

30

1:30

7

Y=34.5613X1.000

0.9999

Golgi

Mitochondrion

40

1:40

3

Y=44.4335X1.000

0.9999

Golgi

Mitochondrion

50

1:50

1

Y=54.6769X0.999

0.9999

Golgi

Mitochondrion

60

1:60

2

Y=63.7480X0.999

0.9999

Golgi

Mitochondrion

80

1:80

1

Y=84.0673X0.999

0.9999

Golgi

Mitochondrion

100

1:100

2

Y=121.991X0.999

0.9999

Golgi

Mitochondrion

200

1:200

1

Y=244.096X1.003

0.9999

 

 

Why is biology so complicated?  We have just seen that a standard laboratory approach to analyzing research data can turn simple events (linear) into complex ones (nonlinear).  When we elect to avoid complexity by ignoring the underlying order, we may be left with something far more complex – a largely nonlinear and “complicated” biology.  There is an alternative.  We can treat nonlinearity in biology as one of the many symptoms of complexity and attempt to unfold it into well-ordered and more convenient families of linear equations.        

 

Are there examples in the stereology literature wherein linear equations alternate with nonlinear ones?  Yes.  In the developing kidney, for example, such a pattern exists (Bertram, et al., 2000); see Table 5.  Growth can be characterized as a stack of equations showing an increase in kidney size.  Nonlinear equations - alternating between linear equations – indicate that the parts are being produced in a coordinated way (r2s close to one), but at different rates.  This means that moving up the growth stack – over time - involves transition states (nonlinearity) wherein the new parts are being built and assembled.  When a given growth step is completed, however, the curve becomes linear and parallel to previous steps (Bolender, 2001-2005).

 

Table 5 The developing kidney – alternating linear (exponent a close to one) and nonlinear equations (exponent a not close to one); After Bertram et al, 2000)

Structure X

Structure Y

Equation

Power Equation

R2

Dev day 17

Dev day 18

Nonlinear

Y= 1.74620X 1.0966

0.9996

Dev day 18

Dev day 19

Nonlinear

Y= 1.79790X 0.9619

0.9999

Dev day 19

Dev day 20

Linear

Y=1.782600X1.0031

0.9996

Dev day 20

Dev day 21

Linear

Y= 2.174000X1.003

0.9999

 

 

Reverse Engineering – Keeping Score

 

Reverse engineering biology requires large amounts of standardized data that can be integrated across the entire biological hierarchy of size - seamlessly.  By assembling a table of minimum requirements (Table 6), we can put the magnitude of the undertaking into perspective.  The table suggests that biological stereology can meet all the minimum requirements needed to reverse engineer biology from organisms to cell organelles – as the contents of the database allow (Bolender, 2001-2006).  In contrast, biochemistry and molecular biology appear to meet only some of these requirements.  This should not come as a surprise in that reverse engineering biology is largely a structural exercise. 

 

We have, however, at least two options.  Either we can add the experimental data of biochemistry and molecular biology to the Universal Biology Database and let the hierarchical order of biological stereology integrate these data or we can design new methods that mathematically combine the data of several disciplines (e.g., Counting Molecules; Bolender, 2005).  Both approaches should work. 

 

Given my reading of the literature, the task of incorporating molecular data into the Universal Biology Database appears largely a function of the experimental methods.  In general, molecular methods tend to be disruptive in that they diminish or eliminate structural information.  Homogenization, fractionation, isolation, purification, digestion, PCR analysis, immunoassay, western blot, northern blot, immunoblot, microarray analysis, and enzymology all forfeit structural order in exchange for access.  In addition, these methods can have notable limitations - not the least of which includes multiple sources of bias (Fluck, et al., 2005).  Indeed, these limitations may explain why molecular biologists routinely experience difficulty in dealing with biological complexity – both locally and globally (e.g., see the O’Reilly Network: Interview with Dr. Leroy Hood; search on <leroy hood complexity>).  

 

Table 6 Minimum requirements for reverse engineering biology using published data; a preliminary assessment.   Can we fill in the missing dots?

Requirements for Reverse Engineering

Stereology

Biochemistry
Molecular Biology

In Vivo Data

 

 

 

     Concentration Data

     Average Cell Data

 

 

     Absolute Data

 

     Cell Counts

 

 

     Molecule Counts

     Minimize Bias

 

 

     Minimize Animal Variability

 

 

     Detect Change Unambiguously

 

     Design Experiments as Equations

 

 

     Enforce Unbiased Sampling

 

 

     Apply Biological Rules

 

 

     Convert 2D Data back to 3D

 

 

     Standardize Data

 

 

     Generate Biological Blueprints

 

 

In Vitro Data

 

 

 

     Concentration Data

     Average Cell Data

     Absolute Data

     Cell Counts

     Molecule Counts

     Minimize Bias

 

 

     Minimize Animal & Cell Variability

 

 

     Detect Change Unambiguously

     Design Experiments as Equations

 

 

     Enforce Unbiased Sampling

 

 

     Apply Biological Rules

 

 

     Convert 2D Data back to 3D

 

 

     Standardize Data

 

 

     Generate Biological Blueprints

 

 

 

Consider three striking issues.  Point one.  Biochemistry and molecular biology both generate large amounts of in vitro data aimed at exploring the behavior of molecules located in in vitro phenotypes.  Here the problem becomes one of knowing when information gained in an in vitro setting faithfully reflects that of the in vivo setting.  Point two.  They both depend heavily on purification techniques.  Unless strict analytical approaches are followed (de Duve, 1974), demonstrating unbiased sampling becomes problematic.  Point three.  A given molecule may generate a troublesome complexity by appearing at several different intracellular locations, in several different cell types, at different times, and in variable amounts.  In such cases, the structural location(s) of the molecule become(s) the critical factor.   

 

How can we solve this engineering problem for biochemistry and molecular biology?  For the in vivo data of molecular biology, the solution may be quite simple.  Use the Cavalieri method to get an unbiased estimate of the volume of the structure.  Next, apply an unbiased sampling method to estimate the concentration (optical density) of an immunocytochemical label and then relate this concentration to the volume of the structure to get an absolute value (see Counting Molecules; Progress Report 2005).  Such an approach will be far more successful in detecting a molecular change reliably than the risky practice of just comparing raw optical densities.  For the remaining data types, the best current solution would seem to consist of forming data ratios – provided the reference variables cancel. 

 

 

Concluding Comments

 

By obeying the laws of nature, biology demonstrates the presence of a mathematical core.  The challenge for us is to gain access to that core.  Why?  It will allow us to solve problems currently beyond our reach.  How do we gain access?  Reverse engineering seems to be the most direct approach.  Here the challenge is to create a large universal database from the biology literature that we can use to extract both local and global information.  

 

The Universal Biology Database described in this report offers a new research technology.  It translates published research data into equations that can become the building blocks of our understanding.  These equations allow us to unravel complexity and to pursue a well-defined strategy of reverse and forward engineering.  The connection repertoire table, which summarizes the decimal repertoire equations, offers local and global views of how biology is constructed by rule.  It reveals – in considerable detail - the complexity of biology’s structural blueprint and may provide the clues needed to capture the intrinsic order of molecules and genes.     

In biology, everything everywhere is connected by rule.  In experimental biology, everything everywhere can be connected by rule.  Knowing this becomes our ticket to the future.

   


 

References

 

Bertram, J. F., Young, R. J., Spencer, K., and I. Gordon. 2000 Quantitative analysis of the developing rat kidney: absolute and relative volumes and growth curves.  Anat Rec 258: 128-35.

 

Bolender, R. P. 2001a Enterprise Biology Software I. Research (2001) In: Enterprise Biology Software, Version 1.0 ã 2001 Robert P. Bolender

 

Bolender, R. P. 2002 Enterprise Biology Software III. Research (2002) In: Enterprise Biology Software, Version 2.0 ã 2002 Robert P. Bolender

 

Bolender, R. P. 2003 Enterprise Biology Software IV. Research (2003) In: Enterprise Biology Software, Version 3.0 ã 2003 Robert P. Bolender

 

Bolender, R. P. 2004 Enterprise Biology Software V. Research (2004) In: Enterprise Biology Software, Version 4.0 ã 2004 Robert P. Bolender

 

Bolender, R. P. 2005 Enterprise Biology Software VI. Research (2005) In: Enterprise Biology Software, Version 5.0 ã 2005 Robert P. Bolender

 

De Duve, C. 1974 Nobel Lecture: Exploring cells with a centrifuge.  From Nobel Lectures, Physiology or Medicine 1971-1980, Editor Jan Lindsten, World Publishing Co., Singapore, 1992.

 

Fluck, M., Dapp, C., Schmutz, S., Wit, E., and H. Hoppeler. 2005 Transcriptional profiling of tissue plasticity: role of shifts in gene expression and technical limitations.   Appl Physiol 99: 397-413. 

 

Seecharan, D.J., Kulkarni, A.L., Lu, L., Rosen, G.D., and R.W. Williams.  2003  Genetic control of interconnected neuronal populations in the mouse primary visual system. Neurosci 23: 11178-88.