by Steve Hoberman

November 2008

The Power of Abstraction

I remember meandering through the large galleries of a modern art museum, and stopping in my tracks in front of a very large canvas painted completely red with a single white dot in the center. The title was “City Skyline”, and as hard as a squinted or twisted my head, I just couldn’t see anything but a large red canvas with a white dot in the center. This was an abstract painting of a detailed city skyline yet was so generic and vague it could have been a painting of a tiger, a summer beach scene, or anything else for that matter. Abstraction is a tool that lets the artist efficiently capture and represent complex topics. As data modelers, we are all artists to an extent, and abstraction is an efficient tool at our disposal as well.

The data modeler is responsible for not only correctly representing the requirements of an application, but also having the foresight to design flexible structures in areas where requirements might change in the near future. There are certain designs we create such as for the data warehouse, where flexibility can make the difference between success and failure of the entire project. The data warehouse is a living design, one that is constantly being expected to easily accommodate new requirements, yet at the same time minimize design changes that impact existing downstream data marts. A very good design tactic I have used over the years to accommodate this need for flexibility is abstraction.

What is Abstraction?

Abstraction is a method to bring flexibility to your logical and physical data models by redefining and combining some of the data elements, entities, and relationships within the model into more generic terms. Abstraction is the removal of details in such a way as to make applicable to a wide class of situations, while preserving the important properties and essential nature from concepts or subjects. By removing these details, we remove differences and therefore change the way we view these concepts or subjects, including seeing similarities that were not apparent or even existent before. For example, instead of having separate Customer and Employee entities, there might be situations where it would be valuable to have the generic Person entity, which groups both customers and employees together.

Let’s go through a more detailed example. Say you are creating a data model for the owner of a tuxedo shop. One of the main entities on your design is Customer. The Customer entity on your data model includes a number of data elements including the following:

  • Customer waist size
  • Customer leg length
  • Customer neck size
  • Customer arm length
  • Customer shoulder width

The owner of the tuxedo shop is very happy with your design and his application is developed based on this model. Two months later you receive a phone call from the owner of the shop, asking where he can store the customer’s shoe size. You explain to him there is no place to put the shoe size, as that was not part of the original requirements. There is a long silence. The owner of the shop explains that he is expanding his business and many of his customers who rent tuxedos will also require shoes. You explain how much money and time it would take to make this change to his application. Again there is a long silence. And then you hear a few clicks as the owner hangs up the phone in disgust.

To paint a brighter picture, let’s say you realize that all of the data elements around leg, waist, neck, arm and shoulder measurements are really just a person’s apparel measurements. Instead of listing each measurement as a separate data element, you abstract this concept in your data model to just two data elements:

  • Customer Measurement Number
  • Customer Measurement Type Name

The Customer Measurement Number data element includes any size value such as:

  • 42
  • 17 1/2
  • 34

And each of these values has a type, captured in Customer Measurement Type Name. Such as:

  • 42 = Arm length
  • 17 1/2 = Neck size
  • 34 = Leg length

A Customer can have one or more of these measurements. Then when we get the phone call from the tuxedo shop owner requesting where shoe size should be stored, we simply add a new row in the Measurement Type table called “Shoe size” and the owner can now enter his shoe size values with this shoe size type. This flexibility in our design has led to effortlessly supporting this new requirement.

Benefits Of Abstraction

Abstraction offers the following benefits:
Provides greater flexibility. Flexibility is the primary benefit of abstraction. We are able to design our structures to better handle the unknown. New situations can arise that we did not originally anticipate, and if we use abstraction these situations may require little if any changes to our data model and therefore keep the database and code intact.

Reduces analysis and design time. Once we are familiar with abstraction, we can apply it to areas of our model before completing a thorough analysis of the requirements. In the example mentioned previously, instead of taking the time to capture all of the additional types of measurements for a tuxedo fitting, we can simply add the measurement number and type data elements and have all possible measurements automatically covered within your model.

Costs Of Abstraction

Abstraction might sound like the cure for everything from the common cold to world peace (slight exaggeration). But its benefits come with a price:
Hides business concepts on the model. When we just have Measurement Number and Measurement Type Name in our data model, we can no longer see the individual tuxedo fitting measurements. Customer Arm Length is no longer a data element we can clearly identify and explain, but a value within the Measurement Number data element. When we abstract we lose the actual business data element name. The data model however, still contains the same information. These measurements become row values instead of individual data elements. Six months after we create the tuxedo data model with abstraction, someone might ask us, “Hey, what does Measurement Type mean?” What sounded like a great idea when we were doing the data model, six months later might no longer make sense. So there is a task we need to do even more rigorously than before. We need to create very good definitions.

Leads to indulgence. Too much abstraction can give us a design totally meaningless in understanding and impossible to develop. Let’s say we continue to abstract the tuxedo data model. Eventually, we will wind up with a single entity called “Thing” with a recursive relationship, as shown in Figure 1. Imagine the power of this entity! We can represent anything with this entity. In fact, we can represent an entire Customer Relationship Management (CRM) application using this single entity! However imagine explaining this model to someone in our business area. Imagine trying to write the code that uses this entity. Now the “Thing” concept is a bit of an exaggeration, but I have seen an item data model where abstraction was taken to the point where it was difficult to find anything business related on the model. It was also very difficult to walk a business user through this model.

Figure 1 – The “Thing” abstraction

Lengthens development time. Although we can decrease analysis and design time using abstraction, we increase development time. It takes longer to write code and logic against row values than against columns. For example, it would be more complex and take longer to write the code to populate or select information out of the Measurement Number and Measurement Type Name data elements than it would be if we had the specific measurement data elements at our disposal. We would need to have more complicated query statements.

Increases reporting difficulty and complexity. There are some reporting tools on the market today that have a very difficult if not impossible time extracting data from abstract structures. There have been several cases where I introduced abstraction into the logical and physical data models, and then had to reverse this decision because the reporting tool, and the query language in general, had difficulty retrieving information from these structures.