HEMA Ratings Part 1: Tiered Tournaments and Unrated Fencers

I have worked on this article on and off over the past half year or so. I started it right after my club’s tournament Revolution Rumble, because I was fascinated by the variance of skill in the tier of almost all unrated fencers, and also how that shook out in terms of HEMA rating. This led to a bit of a deep dive into how HEMA ratings works, what it’s good for, and what it isn’t. Because of that, I have decided to make this the first of a three part series about HEMA ratings. The second part will be about why HEMA ratings is not the best system for a world ranking, and the third part will be about how the pandemic affected the ratings. I’m sure I will talk more about HEMA ratings beyond this, but that’s all I have in the immediate future. 

HEMA Ratings Series:

  • Part 1: Tiered Tournaments and Unrated Fencers
  • Part 2: World Ranking
  • Part 3: Covid Recovery

This past weekend, my club Bucks Historical Longsword held our first big public tournament, which was also the first major tournament in the Philadelphia area. It went very well, and it left me with some things to think about. Our tournament was the first major tournament in the region in several years, the closest since then being King’s Cup in Washington DC (about 3 hours drive away), and the second closest being Boar’s tooth in Boston (about 6 hours drive away). If you didn’t go to either of those two, you had to be willing to drive at least 8 hours to Ohio, or longer down south or to other locations. Some of us were willing to do that (not me, I did go to a couple but only 4 since Covid), but many were not. I don’t blame them, if I had started the sport in the last three years, I would not be willing to travel distances like that just to compete. As a result, there were a massive amount of unrated fencers in our tournament. 

Our longsword tournament was divided into 3 tiers, tier A, tier B, and tier C. Tier C was generally HEMA rating 1000 and below, B was 1000-1400, and A was 1400+, though some exceptions were made in each tier due to personal requests and/or reputation, because HEMA ratings is not perfect. If a fencer did not have a rating, they were placed in tier C by default. We made an exception for 5 unrated fencers whom we knew had competed in tournaments that were either not submitted to HEMA ratings or had not yet been posted by the time of the tournament. Still, that left us with a massive amount of unrated fencers in tier C; out of 43 fencers in tier C, only 5 of them were rated. The thing is, being unrated does not mean someone is any less good at fencing, it just means they are an unknown quantity. Especially for this tournament, since it was the first in the region in many years, we could have fencers who have been practicing for a long time but have just never been in a rated tournament before. I know for a fact that is the case with several fencers from my club.

Separate tier for unrated

This got me thinking, maybe it’s best to separate unrated fencers into their own division. Instead of Tier A, Tier B, and Tier C, it would become Rated A, Rated B, and Unrated. In this way, the unrated tournament could become a sorting mechanism for future tournaments. So how would it shake out in terms of HEMA ratings if you do that? We’ll find out about Revolution Rumble (our tournament) when the ratings are added in the future, but for right now we already have a good case study – Socal Swordfight 2023, Tier D was 62 fencers, and all* of them were unrated. In terms of tiers, assuming 1400+ is tier A, 1000-1400 is tier B, and below 1000 is tier C, of the 62 unrated Socal tier D fencers, 4 of them (6.5%) ended up with ratings that would put them in tier A, 27 (43.5%) with ratings that would put them in tier B, and 30 (48.4%) with ratings that would put them in Tier C. 

Since I started writing this, the ratings for Revolution Rumble have also come out. Of the 43 participants, 5 (11.6%) now qualify for tier A, 17 (39.5%) qualify for tier B, and 20 (46.5%) qualify for tier C. Revolution Rumble ended up with higher total ratings than Socal: Socal’s mean rating was 1050 and the median was 1011.6, while Revolution Rumble’s mean was 1077.9 and the median was 1058.9. This is due to a value in the Glicko-2 algorithm used by HEMA Ratings called “rating deviation,” which led me into a bit of a rabbit hole.

*There were 3 that ended up having pre-existing ratings**, they had been to tournaments in which the ratings were not yet uploaded by the time of the tournament, but for the purposes of this article I say close enough.

**Since writing this, more data has been uploaded to HEMA ratings of tournaments that happened before Revolution Rumble and Socal Swordfight, so if you go on HEMA Ratings now and look at the time periods in question, the data will be different from what I analyzed. You will have to take my word for it that this is what it looked like when I wrote it.

What is Rating Deviation?

If you look at any weapon rating page on HEMA ratings, you will see a colored thermometer on the right side, either green, yellow, red, or gray*. If you mouse over it, you will see the rating deviation. Rating deviation (henceforth referred to as “RD”) is basically an indicator of how sure the algorithm is that your rating is accurate. This can also be referred to as “confidence.” If the RD is a high number, then it is unsure, and if it is a lower value, then it is sure. The “weighted rating” value that we see as our HEMA rating is actually a base rating minus twice the RD. Every fencer enters the rating system with a base rating of 1500 and a RD of 350, which comes out to a weighted rating of 1500 – (2 x 350) = 800. As soon as your first match is logged, your RD will never be above 350. 

Going back to the thermometers, a red thermometer indicates a RD above 200, yellow indicates 100-200, and green indicates less than 100. In general, the threshold for a yellow thermometer is about 5 bouts, +- 2, so not a huge hurdle. At Socal, the pools were smaller, so many fencers ended up with 4 matches or less. In Revolution Rumble, pool sizes were a minimum of 6, with everyone advancing to the elimination bracket, which gave everyone a minimum of 6 bouts, barring bouts dropped due to injury. Because of this, many Socal Tier D fencers ended the tournament with RDs above 200 (IE red thermometer), while everyone in Revolution Rumble ended with RDs below 200. As a result, the average weighted rating of Revolution Rumble participants was higher, even though some participants ended with more losses than Socal participants. 

*Rating decay: confidence also decreases over time, because the more time has passed since your latest data, the less sure the algorithm is of where you belong. This corresponds to a slow decay of weighted rating over time. A Gray thermometer indicates someone who has not had any data added to their rating in the past 2 years. The gray does not specify how high or low the RD actually is. For high RDs, it tends to increase about 0.5-0.8 points per month, which is why you see all weighted ratings decay about 1-1.6  points if the fencer did not add any new data that month. The higher the RD gets, the less it will decay, because it can never reach 350. So if you somehow end up at the top of the ratings and also have a high RD, your rating will decay very slowly, so you will be there for a long time. Your base rating on HEMA ratings will never change due to inactivity, only your RD and therefore your weighted rating. 

Why tiers are good for your rating and HEMA ratings in general, even if you end up losing

  1. Low rated fencers should fence other fencers who already have some matches, not unrated fencers

When I tried to find a pattern for how many matches it takes to get a yellow and a green thermometer, I found that getting below 200 is fairly straightforward (about 5 matches), there was no simple pattern for the green thermometer threshold. I found that there are some fencers with less than 20 matches who have green thermometers (RD under 100), but there are some with up to 70 matches that still have yellow thermometers (RD 100-200). Why does this happen? The lower your RD goes, the more it depends not just on how many matches you fence, but who you fence. Imagine a scenario: 

  • Fencer A: Weighted Rating = 1000, RD = 150 (medium confidence)
  • Fencer B: Unrated, Weighted Rating = 800, RD = 350 (no confidence)

Fencer B beats Fencer A. What happens to the rating?

The algorithm knows that Fencer A reasonably belongs where they are, but it has no idea where Fencer B belongs. The result is it will assume that the unrated fencer was rated incorrectly, and add a bunch to their rating, and equally that the 1000 rated fencer is still mostly correct, and only decrease their rating a small bit. 

Rating Change: 

  • Fencer A: small decrease
  • Fencer B: large increase

For RD, the algorithm learns that Fencer B should be rated above Fencer A, but it doesn’t learn much about Fencer A’s rating because it was not confident about where Fencer B belonged in the first place.

Rating Deviation Change:

  • Fencer A: small decrease
  • Fencer B: large decrease

At first glance this might seem like an argument for integrating unrated fencers into rated tiers – if there are a bunch of overpowered unrated fencers, fencing them at least won’t tank the rated fencers’ (already low) ratings. While this is true, there is another aspect to this, which is that it won’t do much to help the RD (IE algorithm’s confidence that your rating is correct) of the rated fencer. If someone with a rating fences a bunch of people who didn’t have ratings, their confidence does not get much better, and therefore the ratings as a whole remain less accurate, and the individual’s weighted rating also doesn’t get a chance to increase due to decreased RD (remember, weighted rating is base rating minus twice the RD, so this is significant). 

On the flip side, the unrated fencer doesn’t benefit much from the better confidence value of the rated fencer, because a) they will still only walk away from the tournament with a handful of matches, and b) if it’s like Revolution Rumble where the vast majority are unrated, then most of their opponents will be unrated anyway, and the boost will be negligible. Remember that it’s a lot easier to go from low to medium than it is to go from medium to high. If you are unrated, every match will give a large decrease in rating deviation, even if it’s against another unrated fencer. 

Conclusion: Keeping unrated fencers in a separate tier than low rated fencers is better for everyone’s accuracy. Low rated fencers can build better confidence values, which increases their ratings as well as the accuracy of the rating system as a whole.

  1. Matches with Similar Ratings are Good for Confidence

There is another aspect relating to how adding matches changes your confidence value, and that is the difference in weighted rating. New scenario:

  • Fencer C: Weighted Rating = 1700, RD = 150 (medium confidence) 
  • Fencer D: Weighted Rating = 1100, RD = 80 (high confidence)

The expected result is that Fencer C will win, and if that happens, we know that the scores will not shift much. In terms of confidence, what does this match tell the algorithm about how much they belong where they are? Well, it tells it that Fencer C is in a position where it can win against someone who is 1100, which they did. 

Result of Fencer C beating Fencer D:

Weighted Rating:

  • Fencer C: slight increase
  • Fencer D: slight decrease

Rating Deviation:

  • Fencer C: slight decrease
  • Fencer D: slight decrease

What’s going on is 1700 is much higher than 1100, so winning this match doesn’t say much about how Fencer C would stack up against someone who is 1600 or 1800, even though that’s what a rating of 1700 should indicate. Now imagine the 1700 player wins 30 matches in a row against 1100 rated fencers. Their rating will continue to increase, but the algorithm still doesn’t have much information about how well they would do against someone around their own rating, only that they’re better than 1100 rated players, so the RD will decrease very slowly. Because of this, it’s possible to stack a large back catalog of matches while still maintaining a rating deviation above 100. 

Because of this, the algorithm will be more accurate if you fence people who are rated similar to you. If our friend Fencer C fences someone who is 1500 with 80 confidence and loses, then that will result in a major negative swing for them, and not much gain for the 1500 player, because the algorithm assumed their rating was too high but didn’t know any better because they only ever played 1100s. If fencer A fences someone who is 1900 and wins, the opposite will happen (then it will potentially run into an issue trying to find a ceiling for their rating). Either way, it will get more confident, because now it sees examples of fencer A fencing with other players of around its assumed level. 

Conclusion: Separating a tournament into tiers is a good thing for everyone’s ratings and the rating system as a whole, because more matches between fencers of closer rating will occur, which will improve everyone’s RD. 

What’s the point?

Okay, so I’ve said all this stuff about HEMA ratings and how I think tournaments should be set up in relation to it, but what’s the point of it all? Who cares about HEMA ratings? Ultimately, the point is not the math, but the human factor; we want to give people the most meaningful matches possible. By this I mean matches with people around their skill level, where there is a legitimate chance for either side to win or lose. 

In order to do this, we need accurate tiers. On a small scale, it’s easy enough to accomplish this by hand selecting everybody by reputation, but if you’re hosting a large tournament with 100+ longsword fencers like Rev Rumble or 200+ like Socal, there’s no way you can possibly know everybody well enough to make accurate tiers. You need some kind of measurement of relative skill level, and HEMA ratings is the closest thing we have to that at this point in time. Therefore, I think it’s important to try to make your HEMA rating as close to an accurate reflection of your skill level as possible. It doesn’t have to be perfect, nothing is, but it’s a tool that we can use to improve the experience of fencers at events if we allow it to. 

Appendix: Full Stats

Socal Swordfight 2023 Tier D:

Post-Event RatingWinsLossesPre-event RatingRating Deviation (RD)
160790155
1480.371162
1456.761164
1421.672152
1397.5511247.8120
1390.651181
1313.441183
1313.141187
126442181
126442181
1250.431208
1250.431208
1250.431208
1196.732187
1185.132193
1185.132193
1185.132193
1185.132193
1185.132193
1144.421227
1144.421227
1144.421227
110222201
1091.922195
1082.921208
1082.922208
1082.922208
1082.922208
1082.922208
1082.922208
1011.613989.3172
994.811240
993.211253
993.211253
975.9021009.4147
971.712211
944.712227
944.712227
944.712227
944.712227
944.712227
944.712227
944.712227
944.712227
944.712227
94413195
915.413208
915.413208
915.413208
799.203211
756.604201
747.804208
745.903253
745.902253
744.903227
744.903227
744.903227
744.903227
744.903227
744.903227
744.903227

Revolution Rumble 2023 Tier C

Post-Event RatingWinsLossesPre-event RatingRating Deviation (RD)
1486.681156
1480.371162
1465.692143
1458.182148
1449.471164
1410.871168
1377.572153
1326.952171
1326.952171
1301.762163
1254.162173
1209.743167
1189.962592.6159
1174.243173
1137.333181
1130.443985.6135
113033176
1116.543177
1092.552834.6124
1082.542186
1059.634172
1058.934173
1010.624181
994.934177
988.225171
980.933190
962.735798.4135
952.902985.5150
913.915914.1100
91224183
906.525173
899.825173
896.514193
895.225171
88415181
875.316171
875.316171
865.625171
860.426163
813.235169
78215183
684.306717.5127
675.907173

Posted

in

, ,

by