By Heather Hedden

July 2021

Upcoming events by this speaker:

11-12 October 2021:
Taxonomy and Metadata Design

Starting a Taxonomy

A taxonomy can provide many benefits related to organizing information and making it easier to find, but how to come up with a taxonomy can be an obstacle to its implementation. A good taxonomy cannot be fully generated automatically, and, in most cases, a pre-built taxonomy is not suitable. Manual effort is needed to design and build an appropriate taxonomy.

A taxonomy is a kind of controlled vocabulary. A controlled vocabulary is a set of terms (words or phrases) that represent unique concepts, things, or ideas. The terms can serve as tags when they are tagged to content items (documents, web pages, database records, images or other multimedia files). A taxonomy is a specific kind of controlled vocabulary that has structure. This can either be a hierarchical tree structure or a faceted structure, which organizes terms by attribute type to serve as search filters or refinements. Automatic methods can generate lists of terms, but they cannot create the logical structure.

There are many uses and benefits of taxonomies, including:

  • Consistent tagging or indexing
  • Topic/category browsing
  • Improved search results by matching search strings to concepts
  • Discovery of relevant information that was not directly searched for
  • Filtering or sorting results by taxonomy terms
  • Content workflow management
  • Consistent metadata for identification, comparison, analysis
  • Curated content in feeds or info boxes
  • Automatic linking of relevant topics for personalization or recommendation
  • Enhancing knowledge graphs for better data analysis

Each implementation of a taxonomy is unique and involves a specific set of content and specific set of users. Thus, the taxonomy should be built to suit the content and the users. A pre-built taxonomy would not be ideal as it may lack needed terms, have too many detailed terms that are not needed, or may not have the preferred terminology of the users.  Building a taxonomy involves steps that analyze the content and steps that consider user needs.

Content analysis

The first step in building a taxonomy is to determine the scope of content that will be included in the tagging and retrieval with the taxonomy. For externally published content this may be obvious, but for internal content this may not be so clear. Some content may be stored in database applications, such as customer relationships management systems and human resources management systems, which should be separately searched, and not integrated with others due to confidential information and different user privileges.  Once the scope is determined, the different types of content and different subject areas of the content should be identified.

The next step is to analyze a representative sample of content of each of the different types and subject areas of content that will be tagged and retrieved, to identify topics and named entities relevant to the content. This form of content analysis is similar to indexing without a controlled vocabulary.  The taxonomist assumes the role of an indexer or someone tagging the content and notes what index terms or tags would best describe the content. This is done for a significant sample of content of each type and each subject area. The actual number of content items will depend on the number of types and subject areas, the total volume of content, and whether automatic term extraction will also be used.

Automatic term extraction involves using text analytics software to extract candidate taxonomy terms based on their frequency and relevancy within a body of text content. This step should be done after an initial taxonomy is drafted which includes some structure and synonyms for terms. This way, the text analytics can correctly supplement the existing taxonomy to extend it with both new concepts and synonyms to existing concepts, rather than suggest terms that may duplicate each other in meaning.

User input

It’s important to gather suggestions from users when creating a taxonomy in order to customize the taxonomy for the user needs. There are different kinds of users of a taxonomy, though, and all kinds of users should be considered. Taxonomy users include those who try to find content and those who upload or publish content and either manually tag content or edit auto-tagging. There are other people who are not direct users of the taxonomy but may have input based on their roles as user experience experts, taxonomy project managers, subject matter experts, or people who deal with external users in customer relation roles.

The primary method of gaining user input on a taxonomy is through interviews and questionnaires, ideally both in combination, where a conversation follows up on a list of questions sent to the person being interviewed. It’s important to ask different kinds of questions tailored to the different kinds of users, questions dealing with tagging vs. questions dealing with retrieval of content. The input gathered from users in these interviews and questionnaires can be used to better design and the taxonomy and its user interface, to obtain use cases to later test the taxonomy, to identify possible facets for a faceted taxonomy, and also to collect some concepts for the taxonomy.

Another method of obtaining input from users is through a brainstorming session. This method is particularly useful for internal enterprise taxonomies. Representative users from different departments can contribute their ideas by suggesting sample terms, which are written down on a white board, flipchart, or sticky notes, and then working with a facilitator, the brainstorming group can remove outliers, bring together synonyms and similar terms, and come up with categories or facets to group the terms.

Obtaining user input for taxonomy terms can also be more direct, especially in cases where there are subject matter experts for certain sections of the taxonomy. The subject matter experts can be asked to provide a list of suggested terms in their subject area, which then can be reviewed, edited, and incorporated into the larger taxonomy.

Finally, an indirect form of obtaining user input is by looking at the search logs that indicate what users have been typing into the search box. Search log reports can be sorted by search string frequency, so that the most frequently used search strings are considered for inclusion into the taxonomy. The search strings should be edited to confirm with taxonomy style and policy, but the exact search strings should be included as synonyms/alternative labels to support future searches.

Taxonomy building

A taxonomy is built by a combination of a top-down and bottom-up approaches. Top-down refers to developing the top-level terms or facets first and then adding more detailed terms. Bottom-up refers to identifying first the specific terms to include and then developing categories based on grouping those terms. Top-down methods tend to rely more on user input, whereas bottom-up methods tend to rely more on content analysis. Both methods of top-down and bottom-up depend on each other, so both can be done as overlapping tasks rather than consecutively.