Friday, July 27, 2012

Talking Taxonomies

When most people hear the word taxonomy, they probably have a flashback to the high school biology class where they were forced to memorize the taxonomic hierarchy used to classify plants, animals, and other natural objects. This Linnaean classification system (named after Carl Linnaeus, the Dutch scientist who created it) uses a multilevel taxonomy of domain, kingdom, phylum, and so forth to determine where, for instance, a domestic cat should be classified.

The Linnaean system is very useful in certain contexts, but is far too heavily structured for business use. When classifying documents, we generally need only a one or two-level hierarchy in order to place a given item in the correct category. Business taxonomies can also be a bit more fluid, with categories changing over time as requirements shift. Categories may come and go, while more formal taxonomies are rigid and strict.

So, what does a business taxonomy look like? The answer really depends on the organization and what its needs are. At a basic level, a generic taxonomy might look something like this
  • Accounting and Finance
  • Operations
  • Human Resources
  • Engineering
  • Legal and Regulatory
  • Executive
But this is just a starting point, and your business' needs might be totally different. A medical office, for example, would need to divide their documents in a totally different manner based on patient type, visit type, specialty, or other criteria. Or they might need both a business (as above) and a medical taxonomy in order to fulfill the organization's needs.

Reality based

Taxonomies for use with automated document classification software need to be based on the characteristics of documents that fall into the proposed categories. Document classification software must be able to examine a sample set of pre-categorized documents for each taxonomy entry in order to identify key concepts that will allow it to then correctly identify subsequent documents. This is the basic idea behind machine learning, which is the fundamental methodology behind automated content classification.

You can't, for instance, simply establish a taxonomy and expect the software to know what it meant by "accounting and finance" or "oncology" without giving it a context. I use the imagery of training a factory worker to understand the difference between boxes, bottles, and cans in order to separate them into the correct containers as they come down the assembly line. If you hire a worker who (improbably) has never seen a bottle before, they won't be very apt at sorting them from the other two categories.

Likewise, basic instructions about what a "can" is (maybe something like "6" tall, made of metal, flat top and bottom, and contains liquid" might not work in all cases. What about objects that fit 3 of 4 criteria (maybe they're only 3" tall)? How should the worker handle this case? Managing this sort of ambiguous situation is one of the reasons that taxonomy training in automated classification is generally an iterative process; you can't just do it once and expect perfection.

For that matter, what happens if the worker achieves a high success rate but someone changes the characteristics or type of objects that can potentially appear on the line? For example, the company now starts selling "juice boxes"; this means our worker will now see containers that have box-like characteristics (made of cardboard, are rectangular, etc.) but are also bottle-like (contain liquid and 6" tall). How should they handle this ambiguity?

This is one of the reasons business taxonomies are so fluid. New products appear, others vanish, and styles change. Consider the problem of classifying automobiles if you have sedans, coupes, trucks, station wagons (remember those?) and vans as your taxonomy categories. How do you handle the appearance of the SUV, or more recently the "crossover" vehicle?

Overall, taxonomies are relatively simple but require actual forethought and ongoing maintenance by subject matter experts who understand the overall direction of the company and what the ramifications of correct or incorrect classification of a given document might be. We'll talk about them in more depth in subsequent blog entries.

1 comment:

  1. I believe Carl Linnaeus was Swedish, not Dutch. His 'Systema Naturae' was published in the Netherlands, where he lived and studied for several years.