Select Page

Akeneo – A Heuristic and Statistical Approach to Product Attribution using Legacy Data

December 21, 2021

There’s this romantic notion that, data modeling, used as an Akeneo term, is something that needs to be done manually. Quite the contrary, I argue it should be done in an automated fashion using a set of rules and a small set of statistics based on the data that will be managed within the Akeneo PIM. If you’ve decided you need to use a product information manager (PIM), then you’re already managing a large enough set of data, that you can use that data to help you decide on how to attribute that data in the PIM.

Statistics

Two statistics about the nature of the our legacy data adds a great deal of value in manual or automated decision making:

  • A histogram of each data point (datum)
  • The maximum number of the grouped values from a histogram of a datum

A histogram is simply a count of number of times a each value occurs when grouped by the datum value itself. For example, if I have the color values:

Red, Blue, Red, Orange, Green, Green, Green, Blue, Red, Orange, Black, White, Blue, Red

A histogram of those values, sorted by the number of times each value occurs is:

4 Red
3 Green
3 Blue
2 Orange
1 White
1 Black

What can we infer from this histogram?

  • All are character values
  • This data set seems to have a fixed number (domain) of values: 6
  • This data could probably be used in a Simple Select or Multi Select

In case you’re interested, I have two open-source Node applications:

that I use to examine: create histograms from SQL Server or MySQL. Ones for Oracle and PostgreSQL are in the works. They’re free to use if you are so inclined.

With our statistics ready, we can start to make manual or automated decisions about the nature of our data. Let’s first examine the attribute types supported by Akeneo PIM.

Attribute (Data) Types

Akeneo PIM allows you to store:

  • Boolean Values
  • Characters
  • Dates
  • Multimedia
  • Numbers

Now let’s talk about each in detail, in the context of an available histogram for the underlying data.

Boolean Values

Boolean values are false or true. Sometimes specified as an integer value of 0 or 1, or character values like no or yes. In the PIM you can decide to store these values in a Simple Select or a Yes/No attribute. A Simple Select can have option values like 0 and 1, no and yes. Whereas a Yes/No attribute actually stores a boolean false or true, and the slider button displayed reflects the values appropriately.

If you choose to use a Simple Select, you can manage more that just the two states, false and true. With the Yes/No attribute, you cannot.

Which ever you choose, use it consistently, so enrichers and data consumers can expect it represented consistently.

Characters

Character values, or text, in Akeneo terms, contain a string of one or more characters. You can have:

  • a Multi Select attribute, which has a code, label, and option values
  • a Simple Select attribute, which has a code, label, and option values
  • Text attribute
  • Textarea attribute

A Multi Select is an HTML Select list that supports the selection of zero, one, or more values. Using this attribute type is one of the three ways you can support cardinality, one to many, relationships with information in the PIM.

A Simple Select is an HTML Select list that supports the selection of zero, or one value.

A Text is an HTML Text field that supports a limited number of characters. I say limited, because the underlying technology only supports 255 characters (actually bytes), and only about the first 60 characters display on the screen. So I would only use Text, or free form text as I like to call it, for characters values, with an unlimited domain of values, that are 60 or fewer characters in length.

A Text Area is an HTML Textarea that supports a larger, but still limited number of characters. The user interface for Akeneo PIM will only allow you to specify a maximum of 65,535 characters (actually bytes). You can choose to make it Plain or Rich Text. With Rich Text supporting a subset of HTML formatting tags.

If you use a Text Area, and you also intend to use the CSV format to export data, keep in mind that you must prevent, or programmatically remove afterward, any CSV: field or line separator characters in the data, for example: line-feeds (\n), otherwise it will corrupt your CSV files.

Another issue to consider, is if you export to a Microsoft Excel format. That format only allows up to 32,767 characters per cell. So you may want to limit your text are values to that length.

Without the use of line-feeds, it’s hard to format plain text. Fortunately, you can add line breaks in rich text with a line break <br> or a paragraph <p> tag. So I would consider using rich text areas over plain text areas.

Dates

A Date is a HTML Text field where the underlying date, stored as characters in the ISO-8601 date format, is displayed according to the enricher’s locale settings. It does not support a time component.

Multimedia

Media support comes in two forms.

  • Community Edition: File and Image attributes
  • Enterprise Edition: adds Asset Collections

You can use a File, or Image attribute type, with the only major difference between the two being additional support for image preview with supported image types. However, these are stored partly in the underlying filesystem and database of the PIM. If you need better media management, then use the Enterprise Edition’s Asset Management subsystem.

The Asset Management subsystem is to multi-media what the PIM is to data. It is a well implemented basic content management system (CMS) where you can store any kind of multi-media: data, images, PDFs, videos, etc., and use rules to automate transformations of images for supported image types. In addition, if you have an external CMS, you can use the URL data type to preview images from the CMS and connect them to products using an Assets Collection attribute.

The Assets Collection attribute is like a multi select for media, but much better.

Numbers

Support for numeric data types comes in four forms:

  • Decimal
  • Integer
  • Measurement
  • Price

A Number attribute can be a Decimal or Integer value. A decimal or integer value has a maximum value of 9,223,372,036,854,775,807.

A Measurement (metric) is the combination of a numeric value paired with a unit of measure family, and a default unit of measure. If you’re going to maintain data for more than one language or geographic location, i.e. locale, and you have measurements to store, you should use this attribute type.

A Price (monetary value) is the combination of a numeric value paired with a currency, like a metric for money. Once again, if you’re going to maintain data for more than one locale, you should use this attribute.

Reference Entities

And, last, but not least, we have the generic do it all of cardinality, the Reference Entity. A Reference Entity allows you to create a separate entity in the PIM, with it’s own set of attributes, that can then be connected to a given product using a simple or multi select. This is only available in the Enterprise Edition.

Now that we have discussed the statistics we need to make decisions about data modeling, and reviewed the attribute types supported by Akeneo PIM, let’s decide upon a set of rules.

Rules

Boolean Values

Do we want to consistently store our binary decisions, false or true data, as two-state values? If yes, then we should use a Yes/No attribute, where we covert any legacy data in the form of false, true, or 0, 1, or no, yes, etc, to a boolean false or true value in the PIM, which is visualized as a Yes/No slider control.

If no, then consistently storing that data as two, or a list of optional values, using a Simple Select attribute is a better choice. So we can implement one of those two rules.

Characters

Do we want to have higher quality data, based on a fixed domain of option values? Then we examine the histogram for the datum in question. If it has a fixed domain of values, and that fixed domain’s size is less than a threshold value of 100, 250, or 1000 values, then we can map that character data to a Simple Select’s options. If the threshold is set at 100, we don’t have to specify a minimum length for autocompletion, because the user interface, the application, will perform nicely. However, if the threshold is 250, we should probably specify a minimum length for autocompletion of 1, and 2 for 1000. Somewhere beyond 1000 options, the performance of the user interface degrades, along with a human’s ability to choose between so many options. At that point, we should use a “free form” Text attribute instead.

If the datum in question has an infinite domain of values, and they are all under 255 characters in length (I would recommend 60), we can use a Text attribute.

If our datum is longer than 255, but less that 65,535 characters, we can use a text area.

If it’s longer that that, we can move the characters into a file, and use a File attribute.

If we intend on using a spreadsheet for exports, we had better limit the maximum number of characters in a text area to 32,767 characters.

If we intend on using CSV for exports, we had better programmatically prevent any line-feeds in our Text or Text Areas.

If we need to support choosing more than one value, then a Simple Select should become a Multi Select.

Dates

Date values are straight forward, they are converted to an ISO-8601 format when sent to the PIM. If that is not sufficient, you’ll need to use a text attribute, and enforce the format using a Validation rule in the form of a regular expression.

Multimedia

Are we going to use the Community Edition? Then our images and other multimedia will have to be stored in separate File or Image attributes in our products. If we use the Enterprise Edition, we can use its asset management subsystem to create different asset entities to hold different types of media, and then link them to our products using Asset Collections. Or, we can use the File or Image attributes as in the Community Edition.

Numbers

Using our histograms, it’s quite easy to distinguish between decimal (they have a decimal separator) and integer (they don’t have a decimal separator) values. But if the domain of values is finite, then these values may be a better candidate for a Simple Select attribute.

If we have units of measure (UOM), and we are going to use the internationalization (I18N) capabilities of the PIM, we should use Measurement (metric) attributes. Otherwise, a Number attribute that denotes the unit of measure in the attribute’s label is fine.

Like measurements, if we have monetary values and I18N, we should use a Price attribute. Otherwise, a Number attribute that denotes the currency in the attribute’s label is appropriate.

Implementation

During a PIM implementation, we can use these rules to determine how to map our legacy data into the PIM, either manually or programmatically (i.e. automatically).

If our source is a relational database management system (RDBMS), we can use the INFORMATION_SCHEMA.COLUMNS table to advise us on the data types as they exist in the RDBMS.

If the source is a JSON document, we can use the data types of the source data.

I use the same set of histograms from the data to programmatically create:

  • Attributes, including the attribute code and label, and options if the attribute is a Simple or Multi Select
  • Families of attributes
  • Categories
  • Products

By using statistics, in the form of histograms, and heuristics (rules), you can implement the use of Akeneo PIM in a consistent manner. Why use automation?

Let’s say you have 500 data fields, and 100,000 products. A somewhat average implementation from my experience. To manually examine the data for one piece of data (datum), it may take:

  • 30 minutes to scan the raw data to guess its data type
  • 30 minutes to sort it to see if it has a fixed domain, and how many values are in that domain
  • 10 minutes to go into the PIM and create the attribute
  • 20 minutes to go into the PIM and create a family and add the attribute to the family
  • And then there are categories to add…

So even without importing any data into the PIM, you roughly spend 90 minutes per attribute times 500 attributes, which is about 20 weeks of work setting up your PIM. That’s ridiculous.

What if you have more than one source of data? Add even more time…

On the other hand, you can automate the process programmatically for one data source, in as little as 4 – 6 weeks.

Now you know why I argue that mapping data into a PIM, or Data Modeling as Akeneo calls it, is better done by a computer following a set of rules decided on by humans, than humans trying to follow a set of rules.

In case you’re interested, I have an open-source Node framework for data migration and integration against Akeneo PIM’s Web API: node-akeneo. I use it for all my work on Akeneo PIM. It’s free to use, if you choose. I improve it after each project, as I learn more from doing more.

You May Also Like…