Master Thesis Topic

Similarity Analysis of Product Customization Artefacts

Many companies developing software increasingly have to put their resources on customizing their products for specific customer needs. Since software product customizations (SPCs) are more frequent and involve more limited changes than typical releases of the software there is frequent overlap between SPCs and between SPCs and normal evolution of the software.

This thesis aims to analyse and test how text similarity measures, such as the Normalized Compression Distance (NCD) and Normalized Google Distance (NGD), can be used to help detect overlap between SPC artefacts. Potentially this could decrease costs considerably by detecting overlap in early phases of SPC handling. The main focus will be on artefacts related to the request itself and the requirements, but later development artefacts can also be interesting targets.

The NCD (and related measures) have theoretically pleasing properties as being universally best simiarlity metrics but have also shown a lot of promise in practice.


The thesis project will involve

  1. studying and summarizing existing uses of text similarity metrics (TSMs) with a special focus on the NCD and NGD,
  2. describe and analyse the different artfacts involved in SPC handling at a partner company,
  3. testing/experiments how TSMs can be used to detect overlap and similarity between SPCs at the company,


Students interested in this topic should preferably have knowledge/experience/interest in:

  1. software engineering (some),
  2. data/text mining (merit),

Links / Input