Tuesday, March 19, 2013

The Problem of Expansion

A while back, I posted an article discussing the issue of content normalization and how it could help (or hinder) the success of a classification effort. You could effectively think of normalization as a means of sanitizing and generally cleaning up content in order to achieve better results and analytics in general. However, there's another issue that's related to this topic. It's similar to the problem of query expansion in search engines, which is intended to improve recall and generally push more of the correct results to the user.

Consider the following scenario.

You have three categories in your taxonomy: aircraft, automobiles, and boats. You train your statistical classifier using a document corpus as usual, and achieve generally good results. However, you're getting poor results in certain areas and discover (after the requisite research) that it's because there are certain words or phrases in documents that were never used in your training corpus. Let's say the word "oleo" is one problem: you know it's related to aircraft because it's a specific type of landing gear, but it appeared in none of your training documents so you're getting poor results in the proper category when this phrase is present.

The solution, at least in this situation, is probably to enhance the classifier's training by introducing additional training documents containing the salient phrase into the "aircraft" category. That patches this hole, but it doesn't address hundreds of others that might exist in the same small taxonomy. The bigger issue involves terms like hyponyms, hypernyms, meronyms, and others you may never have heard of. Let's talk about this a bit, in order to make the problem a bit clearer.

Given a certain term or word (let's say "red"), one can become either more general or more specific. On the general side of the equation, "red" is a member of the set of all colors. Therefore, the hypernym of red is color (i.e. "red is a color"). On the more specific side, there are subclasses of the color red: crimson, scarlet, carmine, and so forth. These are hyponyms (more specific cases) of the color red.

Likewise, there's the idea of meronyms, which are components or parts of a whole. For instance, finger, thumb, palm, and so forth are meronyms of the word "hand." And let's not forget troponyms, which are more precise ways of describing something: stride, amble, and saunter are example troponyms of the word "walk."

All these things can affect classification results. If one of your categories is named injuries to the hand, and your training documents don't contain all the troponyms for parts of the hand, then you might not get all the results you really want.

The overall problem, however, is how far to go with any of this. Do you need to make sure your training sets contain all relevant terms related to a given topic (possibly difficult at best), or do you need to perform some method of query expansion for specific terms occurring in new documents to be classified? Or does the solution vary by use case?

I suspect the answer is the latter: "it depends" is a great all-encompassing reply when dealing with any issue related to linguistics.

Some search engines already contain this type of functionality, generally in the form of "did you mean?" or "we also searched for" expansion of given queries based on built-in ontologies or other sources. However, the decision of which words in a to-be-classified document that will benefit from expansion is not an easy one to make. One could easily remove a fixed list of stopwords from consideration, but that could still leave hundreds or thousands of others to consider. Expanding all of them is a performance killing option, and one that probably would not produce any discernible benefit to the classification attempt. In all likelihood, the only way to deal with this situation is by knowing the content and taxonomy involved, using the latter in order to define a fixed set of synonyms and other-nyms (I'm not going to write out the whole list again...you get the picture) that will assist classification.

Any activity involving classification of unstructured content is by definition difficult and often ambiguous. Definitions and concepts that are obvious to humans must be taught to machines using rules of logic and statistics that are not intuitive or easy to implement. We're making progress, but it will take time. Until that happens,

Friday, February 1, 2013

Medical Content Classification

Over the last year or so I've been spending a lot of time trying to understand the various medical taxonomies and ontologies that are in common use across the world. This happened because a number of clients and business partners started asking about automating classification of things like ICD-10 ("International Classification of Diseases") and SNOMED codes using a machine learning algorithm. Until that work started, I had no idea how many different coding systems were in common use in the medical industry. Even worse is that many have been "extended" by various national medical systems in order to meet the needs of local providers and/or insurance requirements. This makes creating a machine learning style classifier even more difficult!

The base ICD-10 data set starts off looking like this:

4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholerae 01, biovar cholerae;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002
3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004

Effectively, Cholera comes under family A (A00 through B99 being "Certain infectious and parasitic diseases") and has entry 00. Subtypes like biovar eltor and biovar cholerae have their own entries, e.g. A00.0 and A00.1 respectively. You can see the whole arrangement of chapters/families at the Wikipedia entry for ICD-10. There are also extensions for areas like OB/GYN and cancer, which are not part of the base system but use the same algorithm for taxonomy entries.

The real problem comes in when it becomes understood that several countries (the US, France ("CIM-10"), Canada ("ICD-10-CA"), and others) have created their own versions of the standard. These use the same base codes, but with extensions. The system has also been translated into multiple languages.

In addition, the US has what it calls ICD-10-PCS, where "PCS" standard for "Procedure Coding System." This adds even more complexity to the problem of automating classification, since its coding system is totally different than other ICD-style solutions. Whereas the base standards use what is effectively a 1:1 match between disease and code, the PCS variant relies on a hierarchical 7-character coding system that is positional in nature. Each valid code is exactly 7 characters in length. Each character represents an axis of classification that further refines the exact procedure performed. The first character denotes the primary axis, where 0 = “medical and surgical,” 1 = “obstetrics,” and so forth. The “0” axis is by far the largest: of the 72,081 codes in ICD-10-PCS, 62,022 are in this section.
For an example, consider the following:
      <title>Abdominal aortic plexus</title>
      <use>Nerve, Abdominal Sympathetic</use>
In this case, a procedure involving the “Abdominal aortic plexus” has a <use> tag indicating it must be coded using the preferred (“master”) term “Nerve, Abdominal Sympathetic.” We then look to that section of the coding standard and find:
        <term level="2">
          <title>Abdominal Sympathetic</title>
In this case, the <codes> tag shows the 4-character base “015M” along with a list of 3 possible 5th characters. The proper code depends on the procedure performed. The final 2 letters are denoted as “ZZ”. Thus, depending on circumstances, a procedure involving the Abdominal aortic plexus may be coded as 015M0ZZ, 015M3ZZ, or 015M4ZZ. 
 This type of structure means an automated coding solution needs to make multiple decisions in order to populate each axis (character) of the final code 
All this said, the US still does not use ICD-10. Currently, it uses ICD-9, which is a much older standard with far fewer codes. Version 9 had in the neighborhood of 6,000 codes, while version 10 (US edition) includes 68,000. Dictionaries exist to allow providers to migrate existing records from ICD-9 to ICD-10. Some automated coding solutions for ICD-9 exist, but none have yet been marketed (as of this blog entry, at least) for ICD-10. The complexity is fairly significant.
The problems in terms of automated classification involve either the creation of one or more rule sets that will successfully classify a given document based on identified words and phrases, or locating sufficient documents to train a machine learning classifier to differentiate among thousands of categories. Since it's a given that most classifiers need 20 or more pre-identified sample documents per proposed category, and the thousands of categories (codes) in the taxonomy...well, just do the math.

Other Coding Systems

In addition to ICD-10, providers must also deal with (at least) RxNorm -- a pharmaceutical coding and normalization system, as well as SNOMED -- an ontology for normalizing clinical terms in order to standardize recording. Both systems are designed to make analytics more accurate, as well as to render shared records more readable by other providers. Eliminating local usages in favor of standardized terms is also necessary for successful implementation of national electronic medical records systems that (if ever adopted) will help eliminate records duplication and other "Balkanization of data" issues that presently plague the medical system.
All these systems are useful, but they're all also difficult to implement programmatically due to the complexities involved as well as the need for a high degree of accuracy. Hand coding of medical records already consumes vast resources within health care systems, and is plagued by incorrect hand coding of patient diagnostic or other records. A solution will emerge, but not without a lot of work on the part of developers and architects.

Monday, December 3, 2012

The Case for Normalization

One of the issues surrounding the topic of content analytics involves the handling of synonyms and preferred terms in unstructured content. For the uninitiated, a preferred term is a word or phrase established by an organization as being desirable in its official communications. As an example, someone writing colloquially about monetary policy might refer to "the Fed", but official communications would always use the term "United States Federal Reserve" or some other formal form.

Preferred terms are common, and are used in a number of contexts. The preferred term for "Big Blue" could be either IBM or "International Business Machines", depending on context and who the author is. Corporations often use preferred terms as a means of protecting intellectual property, of projecting a common message to the public, or of avoiding ambiguity in legal documents. But how do they related to content classification and content analytics?

Normalizing for Analytics

Think of this use case: you're interested in mining your existing content for references to specific projects, products, or phrases. As a simple example, you want to find occurrences of  "brake master cylinder" in a series of Microsoft Office documents archived as part of your company's ongoing service offering. You run a query against this term, either via a simple keyword/phrase match (e.g. "grep") or using a complex tool like IBM Content Analytics. The resulting set of matches seems suspiciously low, so you decide to read through some of the content set under analysis. The end result is that you find a mishmash of usages that actually match the desired "brake master cylinder" phrase, but many use colloquial phrases such as "master cylinder" (no "brake"), "brake MC" (acronym rather than full term) and so forth. This means you'd need to follow one of several possible courses of action.
  1. Look for synonyms during your query. This means you might need to construct a complex query term, like "brake master cylinder" OR "brake MC" OR "master cylinder" (in SQL-speak) in order to capture all the results that are interesting in this context. This can become both unwieldy and annoying over time, and could be inaccurate since some unusual phrasings could be missed. It also means you need to maintain a list of valid synonyms and ensure that anyone running individual queries is informed of the current list options.
  2. Execute multiple queries and concatenate the results. In this case, you'd query for the preferred term, then for each synonym individually, adding up the results as they're found. This is both annoying and inaccurate since errors can (and will) occur during the process. If your results are being graphed, it also means the possible difficulty of adjusting the graph by hand.
  3. Normalizing content during queries. If you're aware of the preferred term and list of synonyms, it will be possible to write rules that can rewrite synonyms into preferred terms while the content is being queried. IBM Content Classification's Decision Plan (DP) capability is ideal for this purpose, since both rules and external scripts can be called during DP processing. In this instance, "brake MC" could be rewritten into the document as "brake master cylinder," thus matching the base term without any need for additional user input or awareness. In this case, the actual content in the repository is left unaltered.
  4. Normalizing during ingestion. Depending on the content archive or repository in use, rules or scripts could be used to actually alter the content during ingestion, i.e. when a new document is added to the repository. This may or may not be desirable, since requirements might exist mandating preservation of the original text and phrasing for later use.
  5. Normalizing during creation. Obviously another solution is to instruct users in correct use of the organization's list of preferred terms. This means content will be created with these terms in place, meaning no complex rules or post processing is required. This is far easier in theory than practice, however, and does not account for informal documents such as email messages for which use of preferred terms is not mandated.
There is no single answer or solution to this problem, but the use of synonym resolution during the query phase is common practice for search engines and aggregation use cases. Indeed, IBM Content Analytics offers synonym expansion using dictionaries in order to help with this problem.

Classification and Normalization

In the case of Content Classification, it may be desirable to implement such capability programmatically using rules or external scripts in order to more correctly classify documents  containing known synonyms. This capability can also be used to perform rewriting of documents during the ingestion phase (i.e. permanent alteration of the content) if this is desired. This option may be hazardous, since the rules used to perform such tasks need to be sufficiently sensitive to avoid transforming parts of a document into nonsense. For example, a company would not want to transform all instances of "brake" into "brake master cylinder" since the word following "brake" might be "pad" or "fluid."

This said, the use of external scripts and other normalization techniques should be considered as part of every classification strategy. Organizations generally have some sort of document standards available in place as part of their normal business strategy. These can be leveraged as a starting point for efforts to normalize terms in documents being classified. The end result will be better classification accuracy, better query capability, and improved analytics. And that's what we all want, isn't it?

Tuesday, October 23, 2012

Value Propositions and Uses

One of the unfortunate aspects of the term "classification" is that many people seem to see it as little more than filing documents. Filing is boring, so they don't see the value of the classification process aside from perhaps making a given document slightly easier to find. While this is useful, it's not necessarily enough to make an organization champ at the bit to implement automated classification.

The sad thing about this is that the VP for classification is really much larger overall than most folks perceive. Let's create a few examples to show how valuable the process really can be.

Example 1: your company has a habit of being a pack rat in terms of content. Users save all their old email. No one cleans out document repositories. You're maintaining file shares and other content that was pulled into the organization as part of a merger six years earlier. No one knows who uses any of this content. But, at the same time, you're paying for it. Maintaining the servers, disks, networking, facilities, power, and other components of the overall IT profile cost money, and you may not be deriving any real benefit from that expenditure.

In this scenario, it's entirely possible automated (or even semi-automated) classification could help you dispose of (or at least reorganize) a significant percentage of this content. How much money would this save in terms of hardware support over the course of even a year, if you found that half that hardware could be shut down and eliminated?

Example 2: your company has been hit with a lawsuit, and needs to run through a legal 'discovery' process in order to identify all documents and materials related to the suit. This might involve thousands, or even millions of documents located in multiple repositories. How much money could you save by automating the discovery process? How many errors could be eliminated, potentially saving fines or other negative consequences of having missed documents during the discovery process?

Example 3: you'd really like to make your content more easily accessible by moving it into a folder structure related to its general topic. Under your current storage model, everything sits in a file share, with a folder structure related to long-forgotten project names. You also have multiple file shares, and would like to consolidate everything into a single folder structure. IBM Content Classification, with its Classification Center for files functionality, could do this for you. The most difficult task in this scenario involves establishment of a company-wide taxonomy (list of categories and what each one means) that will be used as the final folder structure. Again, automated classification can save significant time and resources, while offering consistency and reliability far above that achievable by most personnel.

Example 4: there's a need to parse through millions of documents in order to redact or otherwise alter a specific element that may or may not be present (let's say a US-style Social Security Number). An IBM Content Classification Decision Plan could be constructed, with one of its tasks involving searches for a specific regular expression (in this case \d\d\d-\d\d-\d\d\d\d). If this pattern is located, the document could be altered, with the characters composing the expression translated to all "X"s in order to eliminate the personally sensitive information contained in the document.

Example 5: a requirement exists to find customer phone numbers within documents, then extract them and make them into an additional metadata element to be consumed by an external database. IBM Content Classification rules can also handle this, by identifying the required pattern and extracting it to a separate metadata element. This is then returned to the caller for insertion into the database or content repository.

Example 6:  as an ancillary result of ongoing automated classification, statistics about which categories are being discovered at a given instance can be retained and exported to other tools. So, for example, a company could track incoming email or other documents to help spot trends in category frequency, potentially identifying the presence of a product defect or increase in customer complaints (presuming their categories allow for this).

That's right folks, it's not just filing. It's not boring, and it's extremely useful.

Monday, September 10, 2012

Taxonomy Design Pitfalls

In my experience, the biggest enemy of any content classification project involving machine learning software is the taxonomy creation phase. Writing a solid taxonomy requires a solid understanding of the content to be classified, specific intents and phrases that can be used to place a given document reliably in its proper category, and -- probably above all -- agreement among all (or most) involved parties regarding just what each category means.

This last point can't be emphasized strongly enough. I once worked on a classification project that was being managed by an internal I.T. group at the customer site. First off, they had no clear taxonomy in mind and asked for ideas. This was a sign of things to come, and taught me the first lesson of enterprise taxonomy projects: if the customer doesn't know their content, the project is doomed to failure. 

Persevering, I suggested a few generic taxonomies to give them an idea what I meant. They looked at one, said "this one sounds good," and started picking documents they thought fit the categories. I warned them that they shouldn't just choose based on document title or existing folder, and that we needed between thirty and fifty high-value sample documents (i.e. those containing a solid sample of intents and keywords related to the category) per taxonomy entry. What they gave me was over 2GB of documents, often hundreds per category -- meaning they hadn't examined the content and had ignored instructions) to use as a training set for the IBM Content Classification software.

Following several months of training and tweaking, we finally had a working Knowledge Base that performed reasonably well with their content. Most categories returned correct results at or above a 75% accuracy rate, and the managing organization decided it was time to involve other departments by offering a demo of what we'd done so far. This is when I learned lesson two of enterprise taxonomy projects: all stakeholders must be involved in the taxonomy creation phase, because they might have differing views of each category's meaning.

During the demo, I displayed several of the training documents ("corpus") the managing organization supplied us, along with the category the group had placed them into. A minute later, one of the other department heads asked "why did you classify that document as 'Accounting and Finance'? It should be in 'Engineering.'"

Another guy said he was wondering the same thing, but thought it would be better classified as  "Operations." The conference call then descended into an hours-long free-for-all, with each group demanding to know who'd created the taxonomy in the first place and why they hadn't been consulted. No one could agree on even the most basic list of actual taxonomy terms, since they each had their own view into how the content was used. The end result was a six-month delay in the project while the customer organized a cross-disciplinary taxonomy team, which they should have done in the first place, to sort it all out.

The lessons to be learned from this fire drill are pretty clear.

First, get stakeholders involved early on, and make sure there's organizational agreement on the taxonomy's definitions and use cases before attempting a serious classification project. This is not to say that experimentation isn't useful, since playing with different classification models might tell an organization a lot about its data. Dedicate a few content managers or records personnel to investigation of possible taxonomies. Hire a consultant to help. Check into regulatory requirements that might impact naming conventions, document disposition, or other factors. Spend some time on investigation, and proceed based on the results. Good taxonomies aren't dreamed up overnight, and most that are won't perform very well.

Second (a corollary), don't expect your I.T. group to have the skills to create a taxonomy on its own. It may be the true that this group will run the project, but employees who actually create and manage content for specific departments must be involved from the beginning. They know the data and how it needs to be used. Content managers (and even administrative assistants) are far more likely to be aware of how specific document types are to be "filed" than folks who manage the systems and networks.

In a future blog, we'll talk in more detail about procedures and processes that can help create successful, useful taxonomies. Stay tuned!

Thursday, August 16, 2012

Use Cases for Classification

One of the primary issues in today's cash-strapped data centers is how best to allocate ever-tightening IT budget requirements, while simultaneously optimizing service and showing a positive value proposition. As someone once said, "you want fries with that?"

In this environment, some folks might ask whether an application like content classification actually provides sufficient value to satisfy the inevitable question: "what's in it for me?" The answer may surprise even the most die-hard skeptic.

The main problem is that many people see classification as little more than an automated filing system that can reorganize files into a more logical structure. While this is true, and such systems can indeed organize files into a predefined hierarchy, such operations are really only a small part of the content classification value proposition.

Consider the following scenario: your organization has been maintaining a record of all email messages sent through your Exchange server for the last decade. At last count, you had 1 billion email messages in the archive. Yeah, really. Maybe you're proud of that fact, because it means you're able to track old conversations and find records if a regulatory agency or lawyer comes strolling through the door. But is that really true? Two things you may not have thought of, though.
  • First, how much is all the useless data (which, to be honest, may be most of it) in that archive costing your company?
  • Second, in reality (regulators just walked in, and are demanding all files related to the 2002 FooCorp merger by close of business today!) how quickly could you lay your hands on the right content?
Let's say your 1 billion email messages average 10kb in size. That works out to 9537 gigabytes (roughly 10 terabytes) of data. So you're paying good money for all the physical disks needed to store 10 terabytes. Times two, or maybe even three, if your Exchange database is replicated--and if it's an enterprise solution, it had better be. Maybe it's all in RAID 1 arrays, so multiply by 2 again. Let's also assert that 70% (just to keep the numbers even) of the archived messages are non-business content, like old invitations to company picnics or messages congratulating Mildred in accounting on the arrival of her first grandchild.

Going further, we'll say each disk costs $200 since you're smart enough not to be using cheap consumer-grade hardware for your Enterprise arrays. So each copy of that 10 terabyte archive (given 10 1-terabyte drives) costs $2000. "Cheap," you say. "Big deal!" But also consider ancillary costs like...
  • Electrical supply
  • Associated hardware (redundant power supplies, cables, rack space)
  • Air conditioning & filtering
  • Physical security
  • Physical space
  • Employee management time
  • Backups
  • Offsite storage
  • DR planning
 Thus, our duplicated 10 terabytes of array space ($4000 in disks, $8000 or more if RAID is involved) could cost many times that much per year in associated support costs. Let's say the overall cost of supporting this archive works out to $20,000/year. Why spend all that money maintaining a burgeoning archive of old email when you already know 70% of it is useless? If 7 terabytes of archive data could be eliminated using automated classification with an enterprise taxonomy, all that space becomes available for other (presumably more useful) content. Or the storage could be retired altogether, saving support and other costs for more productive projects.

As a side benefit, eliminating that 70% also means that searching and utilizing the remainder becomes faster and more efficient. If you'd like to run reports against that email archive, why crawl a preponderance of data that will just contaminate your analytics efforts?

Classification pays. It can free up storage currently allocated to useless content, saving hardware and support costs while simultaneously streamlining access to the remainder. The value proposition is clear, especially considering today's accelerating accumulation of content. That 10 terabytes of data might double in the next few years. Where does it stop? Why spend good money storing content you'll never need?

Friday, July 27, 2012

Talking Taxonomies

When most people hear the word taxonomy, they probably have a flashback to the high school biology class where they were forced to memorize the taxonomic hierarchy used to classify plants, animals, and other natural objects. This Linnaean classification system (named after Carl Linnaeus, the Dutch scientist who created it) uses a multilevel taxonomy of domain, kingdom, phylum, and so forth to determine where, for instance, a domestic cat should be classified.

The Linnaean system is very useful in certain contexts, but is far too heavily structured for business use. When classifying documents, we generally need only a one or two-level hierarchy in order to place a given item in the correct category. Business taxonomies can also be a bit more fluid, with categories changing over time as requirements shift. Categories may come and go, while more formal taxonomies are rigid and strict.

So, what does a business taxonomy look like? The answer really depends on the organization and what its needs are. At a basic level, a generic taxonomy might look something like this
  • Accounting and Finance
  • Operations
  • Human Resources
  • Engineering
  • Legal and Regulatory
  • Executive
But this is just a starting point, and your business' needs might be totally different. A medical office, for example, would need to divide their documents in a totally different manner based on patient type, visit type, specialty, or other criteria. Or they might need both a business (as above) and a medical taxonomy in order to fulfill the organization's needs.

Reality based

Taxonomies for use with automated document classification software need to be based on the characteristics of documents that fall into the proposed categories. Document classification software must be able to examine a sample set of pre-categorized documents for each taxonomy entry in order to identify key concepts that will allow it to then correctly identify subsequent documents. This is the basic idea behind machine learning, which is the fundamental methodology behind automated content classification.

You can't, for instance, simply establish a taxonomy and expect the software to know what it meant by "accounting and finance" or "oncology" without giving it a context. I use the imagery of training a factory worker to understand the difference between boxes, bottles, and cans in order to separate them into the correct containers as they come down the assembly line. If you hire a worker who (improbably) has never seen a bottle before, they won't be very apt at sorting them from the other two categories.

Likewise, basic instructions about what a "can" is (maybe something like "6" tall, made of metal, flat top and bottom, and contains liquid" might not work in all cases. What about objects that fit 3 of 4 criteria (maybe they're only 3" tall)? How should the worker handle this case? Managing this sort of ambiguous situation is one of the reasons that taxonomy training in automated classification is generally an iterative process; you can't just do it once and expect perfection.

For that matter, what happens if the worker achieves a high success rate but someone changes the characteristics or type of objects that can potentially appear on the line? For example, the company now starts selling "juice boxes"; this means our worker will now see containers that have box-like characteristics (made of cardboard, are rectangular, etc.) but are also bottle-like (contain liquid and 6" tall). How should they handle this ambiguity?

This is one of the reasons business taxonomies are so fluid. New products appear, others vanish, and styles change. Consider the problem of classifying automobiles if you have sedans, coupes, trucks, station wagons (remember those?) and vans as your taxonomy categories. How do you handle the appearance of the SUV, or more recently the "crossover" vehicle?

Overall, taxonomies are relatively simple but require actual forethought and ongoing maintenance by subject matter experts who understand the overall direction of the company and what the ramifications of correct or incorrect classification of a given document might be. We'll talk about them in more depth in subsequent blog entries.