In: Enterprise Biology Software, Version 3.0
Enterprise Biology Software: IV. Research (2003)
Robert P. Bolender
Summary
Among the many challenges facing research biology today, we can identify at least three that promise to accelerate discovery in the near future. These include linking diverse information across a biological hierarchy of size, identifying underlying principles of biology, and transforming biology into an information science. Using the stereology literature as a well-spring, the Enterprise Biology Software Project actively encourages investigators pursuing such challenges by making new information and technologies freely available. The process is simple and direct. First authors of papers applying stereology, as listed each year in PubMed (National Library of Medicine), are invited to submit reprints as candidate papers for the stereology literature database. Once entered into the database, these research data can be accessed as is or used to generate libraries, equations, patterns, platforms, or whatever one might wish. At the beginning of the year, the updated databases, libraries, and software tools are distributed to contributing authors – past and present - on a CD. This release includes the updated stereology literature database (research data through 2002), new libraries (design code, ladder equation), a progress report, and several unexpected findings.
Introduction
Background
What have we learned so far? We now know it is possible to standardize most types of biological data with a relational database, using a mathematical organization based on stereology. In effect, by creating a production database for the literature of biological stereology, we have demonstrated a core facility for research biology and tested the feasibility of producing and distributing an electronic literature (Bolender, 2001a, 2001b, 2002). However, a central challenge of the database exercise was to address – aggressively – the very real problem of complexity in biology. To begin, the problem of understanding complexity was divided into three tasks: (1) organizing published research data by combining three models (qualitative, quantitative, and relational), (2) unfolding complexity into elements by identifying distinct sources (biology, methods), and (3) recombining the elements to look for patterns and underlying principles.
Using this approach, we have seen that our experimental methods of extracting data from biology introduce an uncertainty principle in that all or most of our stereological estimates carry an unknown bias. In turn, we considered the practical implications of bias and showed how to minimize its effects (Bolender, 2002). The exercise using biological algorithms to predict data to and from the genome – from a single seed value – made the indelible point that a believable diagnosis and prediction system would have to be based on equations displaying coefficients of determination equal to one (R2=1). This observation was particularly important because it provided the incentive for assembling two new libraries that offer opportunities well beyond those of diagnosis and prediction. For example, we can now look at change from a higher dimension and begin to understand key elements of its complexity. This recent progress with structural data establishes guidelines for the far more difficult task of connecting the data of stereology with those of biochemistry and molecular biology.
Progress
The Enterprise Biology Software Project follows a data driven route to discovery. Data are taken from research articles, stored in a relational database, and standardized. In turn, these original data are used to generate derived data. The two main products of the stereology literature database thus far include (1) a collection of data libraries and (2) the findings generated by these libraries. The discovery strategy is simple: find order and follow it.
Libraries: Libraries serve as discovery platforms (Table 1). They include one or more user interface screens, data, help files, and often worked examples (e.g., Excel scratch sheets; case studies).
Table 1. Enterprise Biology Software Libraries.
|
Library |
Data |
Entries |
Applications |
|
Standardized Stereology Literature |
|
|
|
|
· Citation – search |
original |
12,853 |
Find references |
|
· Citation – by paper – contl |
original |
1,024 |
Print paper – contl data |
|
· Citation – by paper – contl + exptl |
original |
6,438 |
Print paper - contl + exptl data |
|
· Methods – search SQL script |
original |
1,951 |
Find papers by methods |
|
· Control Data |
original |
14,290 |
|
|
· Experimental Data |
original |
9,386 |
|
|
· Contl data – by data point |
original |
11,051 |
Find data by data point; level |
|
· Contl+Exptl data – by data point |
original |
6,438 |
Find data by data point; level |
|
· Percentage change data |
derived |
7,018 |
Find data by change; level |
|
· Phenotype data |
original |
7,018 |
Find data across 14 levels |
|
Connection Map |
|
|
|
|
· Type 1 (2str/2+points/1level/1paper) |
derived |
182 |
Find connections/minimize bias |
|
· Type 2 (2+str/2+points/1level/1paper) |
derived |
81 |
Find connections/minimize bias |
|
· Type 3(2+str/2+points/1+levels/1paper) |
derived |
323 |
Find connections/minimize bias |
|
· Type 4 (data pairs) |
derived |
21,035 |
Find connections/minimize bias |
|
Data Replicator |
|
|
|
|
· One from one (data from 1 paper) |
derived |
702 |
Predict data |
|
· Many from one (data from 1+ papers) |
derived |
27 |
Predict data |
|
Biological Algorithm |
|
|
|
|
· Connections upstream and down |
derived |
458 |
Predict organs and organisms |
|
Design Code |
|
|
|
|
· Local (data from 1 paper) |
derived |
880 |
Identify and predict change |
|
· Global (data from1+ papers) |
derived |
58 |
Identify and predict change |
|
Ladder Equation |
|
|
|
|
· Total data pairs |
derived |
24 |
Generalize structure in biology |
|
· Organ |
derived |
19 |
Generalize structure by organ |
|
· Cell |
derived |
19 |
Generalize structure by cell |
|
· Organelle |
derived |
22 |
Generalize structure by organelle |
Figure 1 indicates that the stereology literature database currently includes 55,000 data entries, of which more than half represent derived data. This resource offers abundant opportunities for finding connections between and among the many parts that define biology.

Figure 1. Data in the stereology literature database.
Results: The principle findings of the project are listed below.
· Biological data can be transferred from research papers to a relational database and standardized.
· The production database demonstrates the feasibility of creating an electronic literature for the life sciences.
· When stored in a database, published research data serve as a key resource for producing derived data.
· Since biological data are subject to an uncertainty principle, they carry an unknown experimental bias.
· Libraries can be designed that minimize bias (data pairs, design codes).
· Structures in biology are connected by rule (connection model).
· Algorithms can generate organs and organisms from a single seed value.
· Sources of complexity in research data can be identified by viewing data from a higher dimension.
· Relationships of structure to function can be expressed mathematically.
· Change in biology can be generalized and predicted.
· Twenty thousand connections between structures in biology can be summarized by a single exponential equation.
Design Codes
To apply information technologies to biology, we need to understand how biology manages information. We know that DNA stores information in genes that can be translated with the help of RNA into protein molecules, etc. If we imagine that this process of distributing information continues in an orderly way, gaining richness and complexity in forming all the many parts of a living system, then it seems likely that all the steps of the process are connected. In other words, the information and its expression across the biological hierarchy must be nested hierarchically - according to well defined design principles.
To simplify the task of identifying and quantifying connections in biology, we can design a new library consisting of design codes. A design code can be defined as an equation – or a set of equations - that represent rules for connecting the parts of a structure. Moreover, we can assume – for convenience - that design codes are nested hierarchically everywhere – from molecules to organisms. From this definition it follows that a given design code is part of a larger code, while at the same time it contains many embedded codes.
How can we use design codes? They allow us to observe – in greater detail - the behavior of change in living systems. In addition to supplying local information (qualitative; quantitative), design codes also identify global patterns of change that appear when several codes are combined across publications and animals (complex codes). The design code library even supplies a paradoxical view of change – one in which change becomes a “constant.” In effect, design codes suggest that change operates by rule and behaves in a predictable way.
Last year (Bolender, 2002), we observed that sets of equations capable of predicting structure and function from a single seed variable required a set of equations with an R2=1. In the real world, of course, such a requirement becomes impractical. If, however, we relax the requirement only very slightly to R2≥0.999, then we can extract many design codes from the stereology literature. Here the strategy consisted of removing outliers from a data set until the R2≥0.999. In this case, an outlier was defined as a point that did not fall on or next to a regression curve. Details of this harvesting process can be seen in the Excel scratch files that are included with the software upgrade.
Ladder Equations
When exploring biology as an information science, one strategy to follow consists of finding order and then tracking the order to its source. Ladder equations serve as another example of this process. If we start with the 20,000 data pairs in the literature database, form ratios (structure y/structure x), sort the ratios (ascending), and collect sets of ratios that give power curves with an R2=0.999, we can generate a set of 24 equations describing the 20,000 data pairs. Since the slopes (a) of these power curves tend to be close to one, the y intercept (b) of each equation can serve to identify a unit of order. In turn, when the y intercepts are plotted - as if they were rungs on a ladder – we get a single exponential equation of the form y=exa – the ladder equation. Repeat this process, but restrict it to organs, cells, or organelles and we discover additional ladder equations. In effect, order in biology appears as equations embedded in equations. This observation will be of interest to us shortly when we turn our attention to making speculations about how genes might be operating.
Methods and Results
Enterprise Biology Software (2003)
The Enterprise Biology Software package for 2003 updates the stereology literature database through 2002, adds the design code and ladder equation libraries, upgrades applications, and continues to explore complexity.
Stereology Literature Database
Database Update: This year, data from about 400 publications were added to the literature database.
Relaxing the Data Entry Rule: Since the stereology literature database is based on a change model, only those papers meeting the requirements of a critical data set were selected for data entry (Bolender, 2001a). Derived data, however, allow us to change the rules. By upgrading the literature database from a change to a connection model, papers reporting only density or mean cell data can now be added to the database and used selectively to hunt for patterns. This means that a much larger proportion of the stereology literature can now be used to generate derived data.
Relaxing the R2=1 Requirement: In an ideal world, the equations of prediction models would have R2s equal to one. If - in our world - we relax this requirement by only one tenth of one percent, then the stereology literature yields many design code equations. The question, of course, remains the same. How good are the equations at predicting change? If 100% represents a perfect outcome, then the observed mean score of 100.4% missed the goal by less than one half of one percent (N.B., the standard deviation was 6.6 and the number of samples 880). The results would be slightly better if data entry had been strictly limited to equations with R2≥0.999. Several entries with R2s of only 0.99 were allowed for purposes of illustration. In any case, the results suggest that change can be predicted – remarkably well – with the design code equations.
Libraries
All the previous libraries were updated to include the newly entered data and two new libraries were added (design code and ladder).
New Strategy for Searching Libraries: The data pair and design code libraries offer ready access to equations with R2≥0.999. Recall that they:
· Detect a connection between two structures, two functions, or a structure and a function – by one or more papers.
· Detect connections among structures, functions, and structures and functions – by one or more papers.
· Compare control data across several papers.
· Generate equations for predicting structure and function.
· Detect change quantitatively and qualitatively as connected sets - by one or more papers.
· Identify patterns of change.
· Generate equations for predicting change in structure and function.
To simplify their use, both data pair and design code libraries share a similar interface and method for generating equations with R2≥0.999 (Figure 2).

Figure 2: Examples of screens used for selecting data to be analyzed in Excel worksheets.
The procedure is straightforward, consisting of the following steps.
1. Select a structure x and all related structures y.
2. Calculate the ratio of data values (y/x).
3. Sort the y/x ratios (ascending).
4. Send the contents of the screen to an Excel worksheet.
5. Plot the x/y data as a regression (power) equation and calculate the R2.
6. Change the number of rows until the R2≥0.999, removing outliers as needed.
7. Record the results.
Examples of searches and calculations can be called from the viewing screens (7.2, 7.3. 8.1, 8.2) in the BIOLOGYtabs 2003 program.
Design Code Library
Types: The design code library (BIOLOGYtabs 2003; 8.1; 8.2) includes sets of images showing design codes as regression curves; they typically illustrate change. The library includes two collections.
- Simple Design Codes: Identify quantitative and qualitative changes one paper at a time, and
- Complex Design Codes: Identify quantitative and qualitative changes several papers at a time.
Properties and Rules: Design codes offer several features, including some restrictions.
· A design code is expressed as a power equation (Y=bXa), plotting a set of related X and Y values (data pairs) and carrying an R2≥0.999. Recall that b is the y intercept and a the slope. A design code is interpreted by inspecting the values of a and b. When a is close to one, the curve is parallel to the reference line. When two or more curves are more or less parallel (a≈1.0), a b >1 indicates and increase and a b<1 a decrease.
· A plot of X vs. X - control vs. control - serves as a reference line (X=Y; R2=1.0), representing no change (Figure 3). (N.B., one design code can serve as the reference of another.)

Figure 3. Reference line for design codes.
· A design code equation relies on the reference line for its interpretation. It can be parallel to the reference line (qualitatively similar), nonparallel (qualitatively different), above the reference line (quantitatively more), below (quantitatively less), and superimposed (qualitatively and quantitatively the same). A qualitative change signifies a new design code wherein the proportion of the parts is different (Figure 4). In contrast, a quantitative change identifies more or less material in the same proportions (Figure 5). A change often includes both qualitative and quantitative elements (Figure 4).
o A qualitative change

Figure 4. A qualitative change is indicated when the
experimental line is not parallel to the reference
line (N.B., the experimental line shown above includes both
qualitative and quantitative changes).
o A quantitative change

Figure 5. A quantitative change is indicated when the
experimental line is parallel to and above or below the
reference line.
· A complex design code (BIOLOGYtabs 2003; 8.2) combines data from several simple design code equations. It characterizes change by structure and by event.
Structural data (V, S, L, N) can be used to identify both qualitative and quantitative changes, whereas density (Vv, Sv, Lv, Nv) and mean (mV, mS, mL) data can detect only qualitative changes.
Applications: Tab 8 of BIOLOGYtabs 2003 presents the design codes by topics, each having a separate tab (Figure 6). The collection includes design codes calculated by paper (8.1: control, experimental, development, aging, disease, structure to function) and by papers (8.2: complex).

Figure 6. Simple design codes calculated with data
from a single paper.
The reader might begin by scrolling through the total list and noticing that change in biology consists largely of quantitative events (as suggested by the prevalence of nearly parallel curves). Major qualitative events will be found largely in the development and experimental tables, but such change tends to be temporary rather than permanent. In development, for example, the “adult” design code is established early in life with subsequent growth characterized largely by more or less parallel curves (quantitative change).
A design code screen offers several ways of accessing the data and graphs. Use the drop down data window (the button is in the upper left hand corner of the screen) to scroll through a list of y axis structures and make selections by clicking on a highlighted line. Alternatively, type in a key word (or the first few letters thereof) into the field labeled “Search_Structure_Y” and press Enter. For example, <sch> will retrieve all the schizophrenia graphs and <hu> all those from the human. To view data from a specific paper, type in the citation number and press Enter. More advanced searches can be run using the sort and filter buttons (for examples of scripts, see Bolender, 2002). To view the scratch sheets that were used to make the graph - currently shown on the screen - click on the citation number and then on the Excel button. Click on the Abs (abstract) button to read an abstract of the paper online. A help file containing more information can be called from the top page of the folder. The data entry screens used to populate the design code tables can be found in the appendix of the main program (EBS 1.0; 2001) – after the 2003 upgrade has been completed.
Tab 8.2 of BIOLOGYtabs 2003 includes examples of complex design codes (figure 7). Notice that the results are often grouped according to the data pair ratios. The ratio (Y/X) is reported as a value less than (<1) or greater than (>1) one, where <1 identifies a decrease and >1 an increase. The histogram, which shows the distribution of these Y/X data, identifies the extent to which a data pair can differ. The lung and liver show considerable change, for example, whereas the brain shows relatively little. Note too that most change appears quantitative – not qualitative, as indicated by slopes with values close to 1.0.

Figure 7. Complex design codes calculated with data
from several papers.
Ladder Equation Library
Types: The ladder equation library includes a collection of ladder (exponential) and rung (power) equations that together summarize data pairs for total and selected data sets. The summary takes the form of a single exponential equation: y=exa. Usually, this type of equation is used to describe data of the physical sciences (concentrations, radioactive decay, half life, etc.).
Properties: The ladder equation is remarkable in that it can summarize all the structures in the library database - expressed as data pairs – with the single expression:
y = 0.000134e0.7498x ,
where y equals the y intercept of the power (rung) equations and x the number of the rung (e.g., 1 to 24). Figure 8 illustrates the ladder equation for the total data set.

Figure 8. The ladder equation identifies the order of
data pairs as an exponential expression.
Ladder equations display several properties.
· They can trace the origin of nonparallel curves in data pairs and design codes to data existing on different rungs or to data moving up or down the rungs of the ladder.
· Data taken from one rung of the ladder tend to produce a power curve parallel to the reference curve.
· The y intercept of a rung equation is twice as large as the rung below and one half as large as the rung above. In other words, moving from one rung to another suggests a quantum (unit) difference. There is either twice as much or half as much. When the requirements of a critical data set are satisfied, the amount of change attributed to the structures x and y can be determined. In effect, we will need to know the ratios (Y/X) that change, those that do not, and the absolute amounts of each structure X and Y. Can you imagine where this type of sleuthing might take us?
· Structures - all across the biological hierarchy - can be summarized explicitly by a set of connected equations.
· The rung equations can mitigate the effects of data clustering in controls, wherein points tend to cluster about a point rather than distribute along a line.
· Sets of data characterizing a given structure tend to be positioned on the rungs in a similar order. However, these similar data sets can assume different positions on the ladder (see Table 2). Table 2 illustrates the relationship of the nucleus to organelles – across several species. If these animals have similar genomes, how do we explain these results?
Table 2. Leydig Cell Rungs (nucleus vs. structure i)
|
Rung |
Cit 344 Human |
Cit 1403 Human |
Cit 220 Mouse |
Cit 1365 Mouse |
Cit 1208 Guinea Pig |
Cit 2405 Rat |
|
2 |
Golgi |
|
Multivesicular body |
|
|
|
|
3 |
|
|
|
|
Multivesicular body |
|
|
4 |
|
|
|
Multivesicular body |
|
|
|
5 |
|
|
Golgi Peroxisome |
Golgi |
Peroxisome |
|
|
6 |
Reinke Crystal |
|
Lysosome |
|
Golgi Lysosome |
|
|
7 |
Lipid Droplet |
|
|
|
|
|
|
8 |
|
Reinke crystal |
|
|
|
Ribosome |
|
9 |
Lipofuscin |
Mitochondrion Lysosome |
Mitochondrion Lipid Droplet |
Lipid Droplet |
Mitochondrion Lipid Droplet |
Peroxisome Lipid Droplet |
|
10 |
Mitochondrion |
|
Cytoplasmic Matrix |
Mitochondrion |
|
Lysosome |
|
11 |
|
|
|
|
|
Golgi |
|
12 |
|
|
|
|
Cytoplasmic Matrix |
|
|
13 |
Cytoplasm |
Cytoplasm |
|
Cytoplasm |
Cytoplasm |
Mitochondrion |
|
14 |
|
|
|
|
|
Cytoplasmic Matrix |
· Rung equations display order as a set of parallel regression curves, having an R2=0.999 (see Figure 9). Such order seems to be a general phenomenon in biology in that it appears in organs, cells, organelles, etc. Rung equations can tell us something about how structures are constructed and how they change. For example, a quantitative change can be explained as the movement of a data pair from one rung to another– or as no movement at all. If you understand why, then you can use this ladder paradox to explore yet another level of biological complexity.

Figure 9. Rung equations for mitochondrion.
Discussion Science
Biology and Information Science
How does one explore biology as an information science? A curious, yet reassuring pattern that seems to be emerging from the project is that the process of discovery in biology resembles dynamical systems, as described in chaos theory (Waldrop, 1992):
Order → Complexity → Chaos.
However, the Enterprise Biology Software Project reverses the directions of the arrows. Starting with research data in chaos (scientific journals stored on library shelves), complexity reappears by entering research data into a relational database, and order emerges as equations in derived data libraries. In time, this order may lead us to the laws of nature.
Chaos → Complexity → Order → Physical Laws
Dimensions of Information
Biological stereology is a first rate stepping stone into information science. It allows us to access research data reliably and move them from one dimension to another. These dimensional shifts are basic to reliable data interpretations. Working in different dimensions, however, can be problematic. Recall from the following text from the Introduction to Dimensions given in Chapter 2 of Data City – A Short Story (Bolender, 2001a):
“Our familiar world looks very different when viewed from within a given dimension or from different dimensions. In 0 space (space is being used here as a synonym for dimension) you can see points, but not lines, planes, or volumes. In 1 space you can see points and lines, but not planes or volumes, etc. Each dimension therefore has its own set of rules for viewing and interpreting data. Notice that as we move to a higher dimension, the information space becomes enormously richer than the previous one. Grasp the abject poverty of 0 space in contrast to 3 space and it becomes far easier to imagine the astonishing richness of n-dimensional space – especially as n becomes greater than 3”
While these comments where originally directed at biological data, they can be applied as well to the derived data of an information system. Imagine information as a platform from which we can view the same stereological data from different dimensions (0, 1, 2, 3, …, n). A 0 dimension view would be expected to suffer from “abject poverty,” whereas each higher dimension would reveal a greater wealth of information. Does this mean that we can advance our understanding of biology by merely moving our information viewing platform to a higher dimension?
Yes, of course, but first we need to take a hard look at research biology – as it relates to dimensions. Recall the rule given earlier (Bolender, 2001a):
“Always interpret data of a given dimension with rules appropriate to that dimension.”
From an information standpoint, the published research data of experimental biology are being interpreted largely from a platform of 0 dimensions – the familiar change model (Figure 10). To wit: Does the structure or function change, yes or no?
![]()

Figure 10. Standard interpretation of research data.
The limitation of this model becomes quickly apparent when one wishes to explain these events in terms of gene function. Once again we face the problem of complexity in that more than one explanation exists for these results. In a genetically controlled system, we now know that the change shown in figure 10 (experimental) could have been produced by at least three different events (quantitative, qualitative, or quantitative and qualitative).
We also know that explaining gene function will require dimensions higher than zero, because gene function is not an isolated event, but rather one that is highly connected. In fact, the data of a change model is simply a highly restricted view of a connection model (compare Figures 11 and 12). Although most of our experimental papers contain connected data, we have learned to “pull a curtain” around each pair of control and experimental points and then look for a significant difference (Figure 11). Indeed, actively looking for types of change (qualitative; quantitative) and connections within and across data sets is not a common practice in biological stereology.


Figure 11. Change Model: Does it change – yes or no?
The consequence of our selecting a 0 dimensional platform is that we as biologists often end up trying to interpret events in dimensions to which we have no access. We forget that only 0 dimensional questions and answers can be explored from a 0 dimension platform. In truth, the price we are paying for maintaining the antiquated change model in research biology is an “abject poverty” of information. To be sure, there is something sadly amiss when the best solution we can find to the problem of complexity in biology is to simply ignore it.
Design Codes
What happens when we move our information platform to a higher dimension? The design code library allows us to view zero dimensional data (points) from a one dimensional platform (lines). The platform is one dimensional because the library consists of lines that describe how a connected set of structures (and/or functions) are related and how they change. Each design code is produced by fitting 0 dimensional data points to a one dimensional line by calculating a regression equation.


Figure 12. Connection Model: How does it change?
Notice what happens. Once we position ourselves in a dimension higher than zero (Figure 12), we can look back at the 0 dimension and see things that were undetectable from the 0 dimensional platform (Figure 11). For example, we can now clearly see that change – as mediated by a genome – has two distinct and separable properties – y intercept and slope. Recall that Y = bXa, where b equals the Y intercept and a the slope.
Design codes, which represent a set of connected data pairs, display three types of change: quantitative, qualitative, and quantitative + qualitative. However, data sets can display points beyond the domain of the design code – the outliers. Nuclei, rough endoplasmic reticulum (RER), lipid droplets, Sertoli cells, and vessels often fall into this category. Currently, the presence of outliers remains unexplained.
Calculating design codes is an exercise in Excel (Figure 13). Data are plotted as regression curves and filtered by eliminating outliers until the power curve displays an R2≥0.999. These worksheets can be called from the design code screens (e.g., see figure 6).

Figure 13. Excel work sheet for design codes.
Design Codes - by paper (simple)