On rank-ordering very complex datasets

The main idea: the concept of the efficient frontier can be generalized such as to allow the rank-ordering of extremely complex datasets based on a large set of mutually contradicting criteria.

0. Posts on this blog are ranked in decreasing order of likeability to myself. This entry was originally posted on 24.01.2022, and the current version may have been updated several times from its original form.

 

1.1 Say you have a list of things you’d like to compare with regards to a number of criteria. Obviously, if all things are such as to be ranked in the same order across all of these criteria, comparisons are easy. What to do when entities are ranked differently with regards to different criteria though?

1.2 To make this less esoteric, let’s take a simple example: I wish to purchase a used vehicle, and the only relevant considerations are its price (the lower the better), its production year (the more recent the better) and its mileage (the lower the better). I have this narrowed down to six candidates.

1.3 Some are cheaper but have more miles, and/or are older, so not all three relevant criteria give you the same rank of the six options. Which do you pick? You could run a multiple regression predicting price from mileage and year, and you’d get a pretty good fit explaining 98% of price variation! You could then see which of those is under-priced relative to the market. In this example, the Honda is under-priced by a whooping $ 1'874, so a steal eh?

1.4 What if you have a hundred variables though? Starts becoming less enticing.

1.5 And what about data transformation. Maybe I should have run the log of price? In general, regression implies linearity unless you go manually out of your way to try different assumptions. A hundred variables I tell ye, what about then!

1.6 Another, more esoteric issue. Suppose that I include the chassis type, with dummy variables for hatchback, sedan and SUV. I assume if I run the regression I am going to get increasing impact of price in that order. What if I prefer a sedan though, and am indifferent between a hatch and SUV? How can I code that requirement into the regression? Maybe some SUV turns out to be under-priced enough that I have to hold my nose and get it, but how to get the system to work this out for me?

1.6 Here’s an alternative, inspired from financial theory. You start by checking your efficient frontier, so those vehicles that are dominated by no other. On the example, the Honda dominates the Mazda, as it is cheaper and has fewer miles, but same built year. So mark these in green.

1.7 I’d be a fool to look at any vehicle but these four. OK, now count how many vehicles dominate each option.

1.8 Next, note how many vehicles each option dominates. 

1.9 Now the ranking becomes rather obvious. The Mitsubishi dominates two options but is itself dominated by none. In this simple example, as no other option matches this feat, the Mitsubishi must be the best option.

1.10 Our regression in 1.3 indicates the Honda as the steal of the pack, but the Honda only dominates one option (the Mazda), which the Mitsubishi also dominates, plus another. So, the regression was wrong and the Mitsubishi is better than the Honda, despite neither dominating the other.

1.11 Indeed, we can rank-order all alternatives as below (note how three vehicles are equal-ranked at tier 3, there being no information to make a distinction between them).


1.12 Now, this is a simple example to make the point, but sometimes (often) it happens that the ranking implied by the two dominate counts is not quite the same. In such cases, you have to run the same dominate count on the dominate counts themselves, forgetting all about the original dataset! So, how many options are dominated by another, in the sense of dominating more or the same number of options, and being dominated by fewer or the same number of options. Look, a lot of domination in that sentence, but in a nutshell: ignore the original dataset and only focus on how many dominate (this is the new, bad criteria) and are dominated by (this is the new, good, criteria).

1.13 Heck, sometimes even this count is not such as to imply the same ranking across dominate and dominated by counts. So, of course you iterate again. And again. And again. Every time you iterate, you come closer to the ultimate count. After some iterations, you will end up with the iterated count being exactly the same as the count on which the iteration was run, and here you are done: you have extracted all information from the dataset.

1.14 I have never seen a dataset that did not converge like this, even if some have taken days to. They had 300K lines of data ranked on a dozen criteria though, many of which contradictory.

1.15 So there you have it, a dirt simple ranking algorithm that can handle nearly limitless data complexity, only require ordinality of criteria and be applied on gigantic datasets. Potential uses abound, but AI and enabling sorting by multiple criteria spring to mind on the high and low ends of the optimism spectrum.

Comments

Popular posts from this blog

On democracy 2.0

On a share market of most liquidity and least mispricing

On miscellaneous lesser ideas