The Linnaean system is very useful in certain contexts, but is far too heavily structured for business use. When classifying documents, we generally need only a one or two-level hierarchy in order to place a given item in the correct category. Business taxonomies can also be a bit more fluid, with categories changing over time as requirements shift. Categories may come and go, while more formal taxonomies are rigid and strict.
So, what does a business taxonomy look like? The answer really depends on the organization and what its needs are. At a basic level, a generic taxonomy might look something like this
- Accounting and Finance
- Human Resources
- Legal and Regulatory
Reality basedTaxonomies for use with automated document classification software need to be based on the characteristics of documents that fall into the proposed categories. Document classification software must be able to examine a sample set of pre-categorized documents for each taxonomy entry in order to identify key concepts that will allow it to then correctly identify subsequent documents. This is the basic idea behind machine learning, which is the fundamental methodology behind automated content classification.
You can't, for instance, simply establish a taxonomy and expect the software to know what it meant by "accounting and finance" or "oncology" without giving it a context. I use the imagery of training a factory worker to understand the difference between boxes, bottles, and cans in order to separate them into the correct containers as they come down the assembly line. If you hire a worker who (improbably) has never seen a bottle before, they won't be very apt at sorting them from the other two categories.
Likewise, basic instructions about what a "can" is (maybe something like "6" tall, made of metal, flat top and bottom, and contains liquid" might not work in all cases. What about objects that fit 3 of 4 criteria (maybe they're only 3" tall)? How should the worker handle this case? Managing this sort of ambiguous situation is one of the reasons that taxonomy training in automated classification is generally an iterative process; you can't just do it once and expect perfection.
For that matter, what happens if the worker achieves a high success rate but someone changes the characteristics or type of objects that can potentially appear on the line? For example, the company now starts selling "juice boxes"; this means our worker will now see containers that have box-like characteristics (made of cardboard, are rectangular, etc.) but are also bottle-like (contain liquid and 6" tall). How should they handle this ambiguity?
This is one of the reasons business taxonomies are so fluid. New products appear, others vanish, and styles change. Consider the problem of classifying automobiles if you have sedans, coupes, trucks, station wagons (remember those?) and vans as your taxonomy categories. How do you handle the appearance of the SUV, or more recently the "crossover" vehicle?
Overall, taxonomies are relatively simple but require actual forethought and ongoing maintenance by subject matter experts who understand the overall direction of the company and what the ramifications of correct or incorrect classification of a given document might be. We'll talk about them in more depth in subsequent blog entries.