Taxonomic Databases for Inexperienced Users
by Kenneth L. Bowles, Professor Computer Science
& Engineering, UCSD, Retired
and Mary Ann Hawke, Ph.D., Department of Botany, San Diego Natural History
Museum
This page last modified on 19 June, 2009 - - - Return to Ken Bowles' SD Wildflowers page
Abstract
Taxonomic Matrix (Multiple-Entry) databases for use in searching species identity generally are rejected by naive or inexperienced users because the missing-data problem is not recognized (either by database author or user), or because tools to resolve that problem are missing. Data either given non-standard labels by the author, or misinterpreted by the user, may also cause a user's search to fail to identify a correct species just as if there were missing-data.
To cope with these problems, we propose that three regimes should be recognized based on the percentage of all species for which good data is present in the database. We assume the database has at least 25 uncorrelated, and independently defined, menus which describe different characters of the species (e.g. plant height, leaf type, inflorescence arrangement, flower appearance, flower color, flower size, fruit shape, etc.). The regimes:
The goal for most users for using a Multiple-Entry database is to get an approximate species identification for an unfamiliar specimen. In many cases these users have tried to reach that goal using a traditional dichotomous-key - - but they have failed because they lack information essential to resolve one or more couplets (two-way decision points) along the way. While Multiple-Entry databases generally avoid this problem, they can be equally prone to failure unless care has been made by the database author to resolve the Missing/Misinterpreted Data problem. If that care has in fact been taken, then the successful user must ultimately turn to detailed published data for making a final decision on the identity of the specimen.
We propose that the database design can be optimized for the needs of inexperienced users by guiding those users through 3 simple steps:
We demonstrate this approach to resolving the Missing/Misinterpreted Data problem by using a moderately large test collection of San Diego County wildflowers, and a Multiple-Entry database built using the XID database building tool. We have designed this database to meet the needs of generalists who are concerned about identifying species belonging to a wide variety of plant families - not for specialists concerned with detailed characteristics of a single genus or small group of species. The approach used probably applies to many other taxonomic disciplines that have many families. Our approach emphasizes the genus and species, and avoids strong dependence on assignments to families (which have frequently changed in recent years).
Introduction
Multiple-Entry (also known as Matrix or Polyclave) databases have been tried repeatedly by many authors interested in computer-based replacement of the dichotomous keys that have been used traditionally in fields where taxonomy is important for species identification. Table 1 of Keys and the Crisis in Taxonomy: Extinction or Reinvention? (Walter & Winterton, 2007) provides more than 20 links to websites which describe Multiple-Entry database systems that are now in use. The paper Principles of Interactive Keys (Dallwitz, 2006) gives a good general review of the design issues involved with these databases, though it is mainly focussed on one such system: DELTA-Intkey.
Note re. terminology: A general problem that impedes widespread use of most database systems is that of non-standard terminology. In this paper, we use the XID term "Menu" which means the same as the biological term "Character" used in DELTA-Intkey (Dallwitz, 2006).
Having used four of those systems (MEKA, SLIKS, Lucid, and XID) for Wildflower species identification in San Diego County over more than five years, we have experienced one pervasive problem - - Missing/Misinterpreted Data can eliminate a correct species from consideration by an inexperienced user. Normal logic for Multiple-Entry databases specifies that when a selection is made from a menu, and when some species in the remaining list lack any data at all for all attributes in that menu, then those species are removed from the remaining species list.
Generally our main source of data for these databases has been The Jepson Manual (Hickman editor, 1993), recently supplemented by Flora of North America (http://www.efloras.org/index.aspx) for some families. Both of these sources is a compendium of botanic family, genus, species and subspecies descriptions and dichotomous keys collected from many different specialists acting as editors. Some data has been left missing for each genus or species in these sources because editing standards have stressed space limitations, or the editors' differing ideas of utility for identification - - not source-wide standards of uniformity. The data left out varies from editor to editor.
Our definition of a "naive" user or author includes anyone not familiar with, and taking steps to correct for, the Missing/Misinterpreted Data problem. That definition seems to include the vast majority of both authors and users of the 20+ cited database systems. Virtually none of the cited Multiple-Entry databases displays information emphasizing for either author or user that data is missing for some of the still-remaining species referenced by any given menu.
We define an "inexpert" or "inexperienced" user to be one who is not familiar enough with the family, genus or even species of a specimen to know where to find detailed description information in available reference literature.
The developers of Lucid (http://www.lucidcentral.org/) attempted to resolve the Missing-Data problem, in recent versions of the program, by offering a mode which would not display a menu if at least one member of the Species-Remaining list has no attribute data. The menu would re-appear only after the last such species was removed from the list as a result of selections in other menus. This mode proved to be of little value to naive or inexpert users because it makes the overall menu structure difficult to understand.
Richard Old, originator of XID (Old, 2008), advises XID database authors to be careful to provide meaningful data for all species referenced by a menu. His sample database covering broadleaf weeds has been developed with care to assure this rule is followed throughout. In recent private communications, Old has suggested that it would be best for the database playback program to retain all species that have Missing-Data during selection. His rationale is based on the assumption that data is likely to be missing from only a small percentage of the Species-Remaining for any given menu. Moreover, he assumes that a search (in a large database - - typically 1000 or more species) will normally be terminated by the user when roughly 10 species are still remaining - - and that final identity of the user's plant will later be resolved by referring to information separate from the database proper, such as comparison with photographs. Old's logic assumes that data will not be missing from other menus used to reach the 10 species goal, thereby usually removing irrelevant species from the search because the menus involved are not correlated. The developers of the Intkey playback program for use with DELTA databases (Dallwitz, 2006) use a similar logic regarding Missing-Data (which they call "Unknowns").
Importance of Low-Probability Regime
For identification of wildflowers, we have observed that an expert can usually identify the species of a plant from just a few good photographs - - i.e. without bothering to check all the plant properties in detailed reference literature. Avid bird watchers will testify that the same appears to be true for experts in bird identification. We assume that the same phenomenon also applies in entomology and other fields where the identification of species is very important.
In assembling his collection of photographs of over 500 wildflower species in San Diego County (Bowles, 2008), the author of this paper often had to seek assistance from experts familiar with the local flora to get an approximate identification. With dichotomous keys providing the main tool of The Jepson Manual for finding the family and genus within a universe of roughly 10,000 species, this author had to find an alternate way to begin identifying the plant. To summarize: For an inexpert user, The Jepson Manual and similar compendium references are of no use during the early stages when trying to find the identity of a plant.
Over a period of several years, it became apparent that the assisting experts were using relatively little information (e.g. from a small set of photographs of a plant specimen) to offer a close approximation to the correct species identification. Using additional photographs having additional details about the subject plants for checking details in The Jepson Manual, the expert's suggested plant identity almost invariably turned out to be correct. How were the experts able to do this?? Ultimately we conclude that, in most cases, the experts noticed a small number (typically 5 or fewer) of plant characteristics not generally shared with very many other plants in the universe of the database. Figure 1 shows an example of a characteristic shared only among a few genera in the Onagraceae family:
Figure 1. Oenothera deltoides sepals.
As is apparent, the sepals of this flower are reflexed. According to The Jepson Manual descriptions of genera in the Onagraceae family, this is a characteristic shared among species in the Camissonia, Clarkia, and Oenothera genera but not in Epilobium or Gaura or Ludwigia that occur in San Diego County. It is also found in Ranunculus californicus of the Ranunculaceae family.
What's important here is that (in a database covering a 259 species subset of the nearly 500 for which detailed photos have been collected) only 10 out of 259 species have the sepals reflexed character, while The Jepson Manual makes no mention of whether sepals are reflexed, or not-reflexed in all the other species. Complicating the menu structure is the fact that a substantial minority fraction of the 259 species (e.g. 58 of the Asteraceae species) don't even have sepals. So there is a Low Probability of finding plants with the reflexed or similar characteristics of sepals. A menu provided to cover these similar plant characteristics would increase the overall database complexity, and thus make the Low-Probability characteristics unlikely to be noticed by an inexpert user while browsing the more complex database menus for appropriate descriptors.
In Principles of Interactive Keys (Dallwitz, 2006) it is stated that the Differentiating Attributes (i.e. the Low Probability attributes) should not be used for identification. But the need to include these attributes for final checking of a tentatively identified species is clearly recognized.
We believe that it is important that this Low Probability attributes list should be presented to the user in a strong and obvious way flowing naturally from the first two search steps. Currently authors of keys for both XID and DELTA seem to regard this requirement as adequately covered with text in the Descriptive Text section associated with some species. Unfortunately, present practice seems to treat this section as optional and only to be referenced by users who care a lot about details. Even if edited diligently and with a standard way of presenting the data for each species, the extra user-interface steps an inexperienced user must take make it likely that the final Step 3 reality check will not be taken.
Use High-Probability Regime for Initial Search
Data is given on some plant characteristics for practically all flowering species in The Jepson Manual. A Menu in a Multiple-Entry database constructed entirely from attributes that belong to one of these characteristics can easily be made to describe virtually all species in the database. In our 259 species test database there are 11 menus in this High-Probability category - - 10 of them describing plant characteristics that would be easy for an inexpert user to observe. (One of the 11, Plant Duration, covers a range of attributes that an inexpert user would need to estimate. XID allows such a user to combine several attributes such as annual or perennial using OR, thereby preventing removal of a species from the remaining list even if the user's guess about the correct attribute is not quite correct. e.g. the Oenothera certainly is not a shrub or a tree. Intkey treats the OR mode as default that a user must turn off to get exclusive treatment of attributes.) In Lucid either the OR or AND treatment can be used, but must be set as a Matching Method which applies to all menus from which selections have been made so far. The OR treatment is applied by default - - i.e. unless changed by the user.
We recommend that a user should generally start by selecting attributes from a collection of High-Probability menus, thereby reducing the Species-remaining list to 25 or fewer. For example with the Oenothera photos in hand, we selected 3 attributes (Stem Length 2 - 4 dm, 4 Petals (or Sepals), and Color: White) resulting in 7 remaining species, as shown in Figure 2.
Figure 2. 3 High-Prob attrib selected.
A quick check of the Gallery shows one photograph resembling our specimen, as shown in Figure 3.
Figure 3. Gallery
The small box at bottom of this figure results from hovering the mouse pointer over the place where the arrowhead points - - i.e. over the photo that matches our plant. None of the other photos in this Gallery even comes close to matching the specimen plant. Thus Oenothera deltoides becomes the single candidate species for the final Step 3 reality check.
After Initial Search, Check the Marked Low-Probability Attributes
In XID to display the species Description one highlights (or double-clicks) the Oenothera deltoides item, as seen in the Species-remaining list seen in Figure 2. This includes the list of Attributes sampled in Figure 4.
Figure 4. Description
The (sub)menu names are shown in red. Only Attributes marked for this species are also shown in blue. We can easily see that the four marked attributes are all appropriate for this plant. All submenu names are marked "nrc$" to allow a user to recognize which menus found in the XID Analyze and Distinctive Attributes lists are from this menu section.
This quick check allows an inexpert user to approximate what an expert does from memory. In this Test Database we have designed the Notable Recognition Clues menu section of the key primarily to support this form of display as the basis for the final Step 3 reality check step of the search process. It appears first in the complete Interactive Identification menu of XID, not to be a part of the attribute-browsing part of the search, but to appear first in the Species Description list of XID.
Medium-Probability Data of Only One Special Kind Should be Included
On most menu categories, sources like The Jepson Manual provide data for only a subset of the species they describe. This Medium-Probability data, such as the surface condition of stems (glabrous, glandular, tiny hairs, hairy, bristly, etc.) is considered by some contributing editors to be useful for identification, but not significant by others. Even worse, different contributing authors often use different terms to mean virtually the same thing. An author trying to assemble a Multiple-Entry database that includes menus for these categories usually has to leave data missing for the species that lack data in the published source. That database author may also wish to convert differing but similar descriptive terms to just a single "standard" for the database in order to avoid leading inexpert users to a misidentification because of unfamilarity with terms. But the database author is prone to make errors in this process.
We conclude that it is a mistake to include this Medium-Probability data in a Multiple-Entry database intended to be used by inexpert users for species identification purposes.
There is one situation that justifies violating this rule: A large subset of species in the database might be characterized by some attribute that applies only to that subset. For example, in a general collection of wildflowers that includes species in the Asteraceae family, characteristics of the phyllaries are very often cited for identification purposes. Phyllaries are similar to Sepals in most other families, but they are different enough to make it important to separate these dimensions in different menus. The only way to avoid the Missing-Data problem in this situation is to create a special section of menus devoted only to members of the Asteraceae family. There needs to be a way to prevent the Inexperienced user from browsing in this special section of menus until it has been established that all remaining species are in the Asteraceae. XID currently provides no mechanism for the database author to enforce this rule. DELTA does provide for some such dependencies, but is at best awkward to use for the situation described here. Lucid also provides for similar dependencies.
In our test database, we have tried to solve this problem in the special situation menus covering menus that apply to the Asteraceae species by adding a marker attribute Unknown (no data) for the species with Missing-Data. For example, Figure 5 shows one of the Asteraceae group submenus.
Figure
5. No attributes yet selected.
When the added marker attribute is highlighted, the XID right side pane displays the warning shown in Figure 6.

Figure 6. Warning pane
The numbers displayed by XID in the little boxes to left of the attribute names show the number of species still remaining that correspond to the attributes. This is an important feature of XID which is missing from the Intkey playback program of DELTA, and missing in Lucid keys. In Figure 5, no attributes have yet been selected. Figure 6 shows a menu sample from a search in which the warning shown for the Unknown (no data) attribute is from a stage in the selection process when none of the species remaining has Missing-Data.
Special Menu or Description Text for Notable Recognition Clues?
We have discussed the use of Low-Probability attributes to provide quick verification of the identity of a plant species. In our test database of San Diego area wildflowers, we have separated these attributes into a menu devoted to Notable Recognition Clues. None of the submenus of this menu even approaches the goal of presenting useful data covering every species in the database (i.e. without the Missing-Data complication).
Among those who have reviewed this database two alternate methods have been suggested for presenting the list of Differentiating Low-Probability attributes as Notable Recognition Clues.
One of those methods regards the Description text (edited as text by the database author) associated with each species to be the preferred location of this list. Presenting the list in this way would certainly avoid having various menus containing extensive Missing-Data in the database. But doing it that way implies that the author must apply a discipline across the entire set of species in order to present the Notable Recognition Clues list in a way that has (for the inexpert user) an obvious relationship with the first-stage High-Probability menus. It would also make it far more difficult for a user to discover how many species, and how many different genera, share any Differentiating attribute that may be of interest.
In developing our test database, we have chosen instead to take advantage of the existing features of XID in order to present the list shown in Figure 4. This has required leaving an attribute, which might be called Unknown (no data), out of each submenu. That attribute might otherwise have marked every species with missing-data in range of the submenu. Had we added such an attribute to every submenu, the generated Description list of marked attributes would have been visually cluttered by a large number of blue items reading "Unknown (no data)" - - making the Notable Recognition Clues much less obvious.
XID has no feature today whereby an author can arrange to prevent an inexpert user from making selections in the Notable Recognition Clues menus, all of which have many species with missing-data. So as a first line of defense, when the user expands the top level Notable Recognition Clues menu, the right display panel shows the warning seen in Figure 7. All submenu names are marked "nrc$" to allow a user to recognize which menus found in the XID Analyze and Distinctive Attributes lists are from this menu section.
Figure 7.
The Description field of this menu explanation panel could contain text to explain the warning, but most users would just ignore that text.
Figure 7. also illustrates the Attributes marked list for the Notable Recognition Clues menu - - though the warning almost always illustrates for a Species other than the species currently highlighted in the lower left panel of the XID screen. It is debatable whether the Description text field (in black) at top of this panel (possibly in a smaller font, and not in bold text) should explain that this list is shown for illustration only. Highlighting (or double-clicking) any species shown in the lower left panel will display this Attributes marked list for that species. We have placed the Notable Clues menu before all menus intended for selection use in the test database - - because the resulting Attributes marked list is placed near top of the Description panel thereby encouraging use for visual reality checks by the user.
The Best Characters a.k.a. Analyze Lists of Suggested Characters
The Intkey program of DELTA places heavy reliance on a "Best Characters" list similar to the "Analyze" list of XID. Both lists suggest the best Menus (Characters) to use in making subsequent Attribute choices. Users of Intkey are expected to rely on this list whereas users of XID have the option to use the Analyze list when convenient. But as implemented we find that about a third of the top 15 suggestions of both lists offer characters likely to contain error-prone attributes for some remaining species. Intkey does place lower in the list those Menus (Characters) having Missing-Data (Unknowns) for a substantial fraction of the remaining species (taxa). With newly released versions of XID already displaying missing-data counts in its menus, we believe that a similar provision can be added to the Analyze list - thus removing the Notable Clues menus from consideration as best to use for subsequent search steps. The Lucid "Find Best Feature" command works from a similar list, but allows selecting from only one menu at a time. Multiple uses of "Find Next Best" may be required to find a menu suitable in context of a current search - - i.e. the user cannot exercise judgement by quickly scanning the first dozen or so offered "best features".
References
Bowles, Kenneth L., 2008, http://www.kenbowles.net/sdwildflowers/
Dallwitz, M.J, T.A.Paine, E.J.Zurcher, 2006, Principles of Interactive Keys, http://delta-intkey.com/www/interactivekeys.htm
Flora of North America Editorial Committee, eds. 1993+, Flora of North America, Oxford University Press, available online at http://www.efloras.org/
Hickman, James C. (ed), 1993, The Jepson Manual - Higher Plants of California, University of California Press.
Old, Richard R, 2008, http://www.xidservices.com
Walter, David Evans and Winterton, Shaun, 2007, Keys and the Crisis in Taxonomy: Extinction or Reinvention, Annual Review of Entemology.