Pickled Light: Commodore 64 Blind AB Audio Tests

At the links below, I have declared blind AB Audio tests comparing an original MOS 6581 SID chip with modern SID replacements:

Test 01 - 6581 SID versus ARMSID

Test 02 - 6581 SID versus Nano SwinSID b

Anticipated Questions

What is this?

It's very simple: an AB audio test allows you to directly compare two or more audio samples. For these tests, I've recorded outputs from an ARMSID, a Nano SwinSID b, and an original 6581 SID chip. If you decide to take any of the tests, you will be asked to listen to 5 audio samples and for statistical relevance you must listen to each sample 10 times. Therefore, there are 50 iterations in total. During each iteration you will be presented with two options, A or B.

In each iteration, option A and option B will be automatically assigned to each chip at random, shuffled with the Fisher-Yates algorithm. The test is blind, so you will not be told which option is the replacement and which is the original SID. Instead, you simply decide during each iteration which sample you prefer, A or B. At the end of all the tests you will be presented with a report providing a summary of your selections for each tune and will then reveal which chip, the original or the replacement, you indicated a preference for, if any. To start the audio, simply press the round A or B button, you can adjust the volume by using the slider at the bottom.

It goes without saying, but I'm going to flat out say it anyway, absolutely no personal information, at all, is gathered.

Which SID files did you use?

I used 5 PAL SID files:

Master of the Lamps by Russell Lieblich (1985)
Ocean Loader 3 by Peter Clarke (1987)
Spijkerhoek by Edwin van Santen (1989)
We M.U.S.I.C. 1 by Ben Daglish & Antony Crowther (1986)
Wizzball by Martin Galway (1987)

There are thousands of SID files to choose from, it's very likely your favourite isn't amongst this list.

Why do this?

Because I think it's interesting and, as of now, I'm not aware that there are any blind comparison tests of this nature currently active. Imagine the following scenario:

I have bought, or built, a new-to-me Commodore 64 but it doesn't have a SID chip. If I want audio, I need to either buy an original SID second hand, or a modern replacement. However, I have no idea how they compare to each other. Is a new implementation good enough? How do I know? Well, unless I have some way of directly comparing them, I don't. This test allows anyone to make an informed decision without bias. Having done the tests myself, even knowing what to expect (so for me it wasn't really blind), I was occasionally quite surprised at what I was hearing.

The alternative, asking in an online forum for example, surfaces many opinions, but little in the way of useful, practical audio information. That said, you should still do your research - some modern replacements have hardware limitations the original SID doesn't (like paddle support) so you should gather all the information you can. These tests merely add to that.

Why select these particular replacements for this comparison?

That's easy, because I already own them, bought with my own money with no affiliation or sponsorship. If I buy one of the other modern replacements, I might do the same test for that but don't hold your breath.

Are you trying to prove something?

I've no agenda. I have no affiliation with the developers of the replacements and MOS Technology have been gone for decades so I've no investment in them. Beyond wanting a fair way to directly compare chips I have, without bias, I'm not trying to prove anything. The results of the test simply indicate a personal preference and allow you to come to a conclusion to aid a buying decision. There is no right or wrong.

This test is flawed because [insert reason here]

Probably.

Audio is a divisive, emotive subject. If we think about it logically, all the modern implementations (bar none) are trying to emulate the original SID chip. Therefore, if we think in terms of "best" and "worst", an original, fault-free 6581 SID should always be the "best" and everything else should be "worse". The student can never beat the master. Or can they? Honestly I don't think that's for me to decide, I have my opinions, these will not be the same as yours. Both are valid.

As a technical exercise however, I have done everything possible to make the test fair. All audio was played and recorded using the same identical software and hardware. I used my SixtyClone 250466 to play the SID files and the only thing which changed between capture sessions was some time, and the SID chip/replacement. The audio samples were all captured in Audacity, aligned to start and end at the same time, and saved as lossless FLAC files of identical length. If the audio samples were to be generated and captured with your different equipment, the audio would likely sound different too.

Another difficult issue is that original 6581 SID chips do not sound the same as each other and so there is no perfect standard. Some might say this makes an objective test impossible. However, assuming the 6581 SID I'm using has no obvious faults, my recorded samples are indicative of a single random example. I do acknowledge though, that if you were to repeat this test with recordings of your 6581 instead of mine, the results would be different, which is less than ideal.

The ARMSID has a utility that allows you to alter the audio filters which in turn affects the audio output. There are 45 possible filter combinations from which to select. There is no way in hell I'm going to prepare 45 separate tests with all possible combinations, so the settings used are as follows (this is what I have it set to normally):

How you choose to listen to the audio in these tests may also affect the outcome. Only you can decide what works for you, but at a minimum I'd suggest a quiet environment with equipment capable of hearing some fine detail.

And finally, it's extremely likely that the particular chip-tunes used here don't push a particular filter to its limit or demonstrate a particular weakness. Unfortunately, in a test such as this, it cannot be all encompassing. Time and attention spans are limited and frankly, the parameters of the tests currently (5 samples 10 times each) are already asking a lot. Expanding it to cover more scenarios is too hard on the listener. It's indicative, not definitive.

So yes, the tests are not perfect. Ultimately, it is your choice to take the them or not, and if you do, it's your choice whether you put any stock in the results. You are, of course, 100% free to create your own blind test with better parameters if you feel that strongly.

What do the results mean?

After you have completed the 50 iterations for the 6581 versus ARMSID test, for example, let's say the results indicate you preferred the ARMSID 25 times, and the original 6581 25 times. That's a 50/50 split and could indicate you had no real preference, or that you couldn't hear a difference. However, perhaps, in 2 of the audio samples you preferred the original SID 100% of the time, then in another 2 samples you preferred the ARMSID 100% of the time and for the 5th sample it was a 50/50 split, that could indicate your preference is for a particular chip when playing a particular chip-tune.

Maybe your results show a clear preference for one chip over the other in all samples. This outcome is a little more clear cut and less open to interpretation.

In all circumstances you may wish to consider the p-value. This number indicates how likely it is that your selections were based on chance. A very low p-value means the results are less likely to be chance, a very high p-value means the results have a high likelihood of being chance. So, theoretically, if you took all 50 tests and simply selected A or B at random then no pattern would emerge and the p-value would be very high. Thus, the consensus says, the lower the p-value, the more reliable the result. This is not perfect, however: please bear in mind that it is perfectly possible to toss a coin 50 times and get the same result, every time - it's not "likely", but it is "possible", and that would still give you a very low p-value (8.88 × 10^-16 to be exact). So, if you find you can't really hear a difference in the samples, and so you end up just choosing A or B at random, you might, by chance, select the same chip every time. You must bear that in mind when you consider your results.

At the end of the day, the results are yours to interpret as you please. I've said it already, but it's important so I'll say it again: these tests cannot possibly cover all scenarios the average SID chip is likely to encounter so your real world experience will be different. These tests are indicative, not definitive.

Will you be repeating these tests with an 8580?

Nope.

Where can I find out more about the test method?

Here.

Friday, 13 December 2024

Commodore 64 Blind AB Audio Tests

Anticipated Questions

Popular Posts