NSGminer v0.9.2: The Fastest Feathercoin / NeoScrypt GPU Miner

ghostlander

About 40K FTC received in donations, that’s about 0.3 BTC. Keep’em coming ;)

I’ve ordered a Gigabyte GTX 750 Ti (GV-N75TOC-2GI). Have got it today and testing now. I’m able to get 115KH/s at 1200MHz shaders or 133KH/s at 1400MHz shaders (~70% TDP). There are minor issues, but shares get solved just fine, no HW errors.

RIPPEDDRAGON

sending 800ish ftc for now to you, have my withdrawal addr set to you for now as well ;)

running at 3MH/s currently, hoping to add another 2.4 MH/s tonight Muhaha, lots of work though holy shhhh

AmDD

This is probably a dumb question, but can you explain the screen output? Its been awhile since I have used CG or BFGMiners.

Specifically the hash rate, there seems to be three per video card listed - what does each one represent?

ghostlander

@AmDD It’s in the README:

5s: A 5 second exponentially decaying average hash rate
avg: An all time average hash rate
u: An all time average hash rate based on actual accepted shares

The same per GPU.

AmDD

@ghostlander said:

@AmDD It’s in the README:

5s: A 5 second exponentially decaying average hash rate
avg: An all time average hash rate
u: An all time average hash rate based on actual accepted shares

The same per GPU.

I saw where it explained 5s and avg but didnt continue reading further where it mentions the columns. Thanks.

RIPPEDDRAGON

I used to run my miners with -c pools.conf and had multiple pools with failovers in that file, what is the syntax for this miner? I cant seem to get it to work with my pool file.

ghostlander

@RIPPEDDRAGON Let it write a config file and edit it. [S] and [W]

RIPPEDDRAGON

thanks, ill take a look when I get back tonight

Wolf0

Still not quite the fastest - I can reach just under (sometimes slightly above) 430kh/s on a 7950. Ghostlander’s latest result (his v7 Beta kernel) that he posted here: https://bitcointalk.org/index.php?topic=712650.msg13611456#msg13611456 is 400kh/s on a 7990.

Now, there IS a big discrepancy with our clock speeds - I’m running 1100/1500 while he’s running 850/1250 - but there’s also a discrepancy on compute unit count: a 7950’s core has 28 compute units (CUs), while a 7970’s core (of which the 7990 has two of) has 32 CUs each. Because of this, and the face NeoScrypt as used in Feathercoin’s proof-of-work is compute-bound on current GPUs - NOT memory-bound, as is a common misconception, or at least, a good implementation should not be - on a 7970 (or R9 280X), NeoScrypt should have a 14.286% boost in hashrate over a 7950 (or R9 280).

For fun, before I lower my clocks and test again, I’ll estimate based on core clock speeds: My kernel should lose 29.412% speed at 850 core (I’ll drop the memclock, too, but I don’t think it’ll have much of an effect at all), which means my new hashrate on 7950 should turn out to be 303kh/s.

@ghostlander said:

I recall SGminer recommends to use xintensity. It isn’t guaranteed to deliver power of 2 thread numbers which is a must for my kernel. The classic intensity results in (2 ^ intensity) thread numbers which is fine.

I highly disagree that this is “fine” - assuming that means optimal. However, I haven’t patched in any xintensity support yet, so I’m still forced to use the very coarse-grained classic intensity as well.

Wolf0

I’m rather surprised - but not totally confused - by the results of my test. The clocks 850/1250 on 7950 result in a hashrate over 350kh/s on 7950 with my code. I wondered if there was a reason besides the heat (and therefore probably downvolting) that caused Ghostlander to pick those rather low clocks.

ghostlander

@Wolf0 350 * 32 / 28 = 400KH/s

That’s what I have now. Maybe a little bit more. The primary reason for downvolting and downclocking is power consumption indeed. HD7990 with the default 1000/1500 clocks @ 1.2V eats too much power (400W+) and air cooling cannot keep it within 85C. Now it’s 250W @ 1.0V and gets within 70C just fine.

Modern high end GPUs have excessive memory bandwidth for NeoScrypt with quality kernel optimisations. That’s not a problem. SHA-256d ASICs don’t consume it at all while Scrypt ASICs may also be compute bound depending on eDRAM size/speed and number of execution units.

I can patch in xintensity and/or rawintensity, though I doubt it’s going to make a big difference.

Wolf0

@ghostlander said:

@Wolf0 350 * 32 / 28 = 400KH/s

That’s what I have now. Maybe a little bit more. The primary reason for downvolting and downclocking is power consumption indeed. HD7990 with the default 1000/1500 clocks @ 1.2V eats too much power (400W+) and air cooling cannot keep it within 85C. Now it’s 250W @ 1.0V and gets within 70C just fine.

Modern high end GPUs have excessive memory bandwidth for NeoScrypt with quality kernel optimisations. That’s not a problem. SHA-256d ASICs don’t consume it at all while Scrypt ASICs may also be compute bound depending on eDRAM size/speed and number of execution units.

I can patch in xintensity and/or rawintensity, though I doubt it’s going to make a big difference.

It’s not quite the excessive memory bandwidth - the bandwidth isn’t used much - ESPECIALLY not in your kernel (at least the last one I saw.) The reason is simple - you can only really run a limited number of work-items before the CUs can’t do it in parallel and the extra wavefronts get queued up for execution after the current ones complete. Because of this, only the waves currently in flight are going to be accessing memory at all. NeoScrypt is NOT memory intensive - you can’t even exhaust GPU memory before you exhaust compute if you run NeoScrypt with parallel SMix() calls - at least not on 7950 (R9 280) or 7970 (R9 280X). I know because I tried. Now, keeping in mind you can only run as many work-items in parallel as your compute resources will support, you now factor in the fact NeoScrypt doesn’t really do a ton of lookups compared to the level of computation, AND that said lookups are random and not sequential, and you get the result that memory bandwidth isn’t going to help you really at all. In your kernel particularly - not only do you have few waves in flight, but you’re running NeoScrypt in such a way that the SMix() calls are sequential! Again, this is last I looked, and GitHub is currently down. Because of this, each wavefront is actually doing half the memory lookups it could be (at a time) because it’s sequentially doing the ChaCha SMix() and the Salsa SMix() - they aren’t done in parallel, so the memory bus is less utilized.

A smaller issue you have is code size - weighing in at 121,264 bytes, that’s fully 3.7x the GCN code cache. These three extra fetches will hurt you - it hurt me pretty badly in my main Neoscrypt loop until I finally coaxed the compiler to not duplicate code in it. Additionally - your XORs in FastKDF aren’t doing you any favors - the design of FastKDF makes it a bitch to do aligned copies, but not impossible. For now, however, I’ve just done XOR into destination in an aligned fashion, as I’ve yet to make the non-byte-wise copies and other XORs play nice and FINALLY coax the AMD OpenCL compiler to do away with the scratch registers, which I believe are costing me quite a bit. Your XZ variable is also one ulong16 larger than it needs to be for the main function - it’s only needed inside FastKDF. I haven’t checked the disassembly to see if this hurts you, but it could be causing undue register usage as because it isn’t local to the FastKDF function (and the AMD OpenCL compiler is very, very stupid), it might just be using those registers and not reusing them in the main portion of NeoScrypt.

There are other smaller bits, like your little if/else branch where you XOR into the password isn’t needed (and you know how GPUs hate branching, I’m sure) - but I don’t think those are hurting you too much, except maybe cumulatively.

EDIT: Oh, almost forgot to respond to the comment about xintensity and rawintensity (hereafter xI and rI, respectively). The options intensity and xI are really convenience options - the only one you REALLY need is rI, as any value passed to the others has an equivalent with rI. Now, the reason why I believe xI and/or rI will make a small, yet significant difference is because of scheduling. If you load up the compute units of your card manually, this takes a bit of the load off the scheduler, which MAY do something sub-optimal - for example, 2^14 is 16,384 - 64 work-items per wavefront and 28 CUs per GPU on 7950 (and R9 280) - if the host code doesn’t enqueue fast enough, or for some reason there’s some kind of stall, you could end up running (16,394 / (28 * 64)) = 16,384 / 1,792 = 9 waves with the remaining ~.143 scheduled to run alone - leaving most of the CUs idle for that run. It’s not like this is extremely likely to occur often, but I’m thinking it likely does occur, because I’ve seen improvements by using xI instead of regular intensity with NeoScrypt before now. Because of this, while I do not believe the importance to be major, I do believe it to be substantial.

Wolf0

I still haven’t patched in more precise intensity, but I have managed to improve upon my 01/17/2016 record by around 2.657% - 425kh/s on 7950 at 1050/1500 now. To compare with the 7990, I also ran a test at 1000/1500 - 410kh/s to 411kh/s. I’ll do some tests on power draw later.

ghostlander

@Wolf0 I have optimised the most important XOR in FastKDF already. It was a bottleneck to do it bytewise on GCN. 120K kernel size isn’t very large because Salsa and ChaCha separately fit the code cache and FastKDF has more important issues like memory alignment. I’ll try to optimise it better.

wrapper

I like the idea on optimising on “power efficiency”, not “speed”. ;)

Wolf0

@ghostlander said:

@Wolf0 I have optimised the most important XOR in FastKDF already. It was a bottleneck to do it bytewise on GCN. 120K kernel size isn’t very large because Salsa and ChaCha separately fit the code cache and FastKDF has more important issues like memory alignment. I’ll try to optimise it better.

Which XOR would that be? I feel like I’m derping and missing something obvious, but I see the ending XOR with the if/else branch outside the loop, and the XOR inside the loop which is done with a call to neoscrypt_bxor()… I just looked at your current git again, double-checked this, then read the neoscrypt_bxor() function again - it’s still bytewise. Unless you mean something you’ve not pushed, in which case never mind. If you have, then nice - my trick with the aligning the XOR worked out for you.

Anyways, you seem to be working from the outside in, rather than from the inside out, when it comes to the optimization of the code - the “outside” being the portions with less time spent, and the “inside” being the opposite. You really might want to look into SMix() - that’s where you really can gain hashrate.

@wrapper said:

I like the idea on optimising on “power efficiency”, not “speed”. ;)

They are almost always one in the same in the GPU arena. If I have shitty, slow code, it leaves portions of the GPU unused, or at least under-utilized, causing the lower power consumption people notice. However - if these resources are used well, then the hashrate goes up far more than power does - I actually have records from my really old X11 optimizations to show this, as well as exact percentages taken from runs of the (then) stock X11 shipping with SGMiner and mine on Freya.

ghostlander

@Wolf0 https://github.com/ghostlander/nsgminer/blob/692e2ef2946229cf057dd006c8e85c8674f0342f/neoscrypt.cl#L713

It’s executed 64 times per hash. The final XOR outside the loop is less important.

@Wolf0 said:

Unless you mean something you’ve not pushed, in which case never mind. If you have, then nice - my trick with the aligning the XOR worked out for you.

Well, I added it to my beta 10 days ago. You have mentioned to do bytewise XOR in uints, I have vectorised it which is also fine. Not uploaded to GitHub yet, but quite a few people use it right now. It’s well improved over the previous release in performance and compatibility. I see only a 5% decrease while switching from 14.6 to 15.7 drivers. It was much worse before (https://bitcointalk.org/index.php?topic=712650.msg13585416#msg13585416).

Wolf0

@ghostlander said:

@Wolf0 https://github.com/ghostlander/nsgminer/blob/692e2ef2946229cf057dd006c8e85c8674f0342f/neoscrypt.cl#L713

It’s executed 64 times per hash. The final XOR outside the loop is less important.

@Wolf0 said:

Unless you mean something you’ve not pushed, in which case never mind. If you have, then nice - my trick with the aligning the XOR worked out for you.

Well, I added it to my beta 10 days ago. You have mentioned to do bytewise XOR in uints, I have vectorised it which is also fine. Not uploaded to GitHub yet, but quite a few people use it right now. It’s well improved over the previous release in performance and compatibility. I see only a 5% decrease while switching from 14.6 to 15.7 drivers. It was much worse before (https://bitcointalk.org/index.php?topic=712650.msg13585416#msg13585416).

OH, lol, yes, that is good, but that was not what I meant! This line:

[code]
neoscrypt_bxor(&Bb[bufptr], &T[0], 32);
[/code]

I’m saying I did this operation using uints.

ghostlander

@Wolf0 I get it. I’ve also rewritten it. The code quoted is plain bytewise, though old VLIW GPUs like it for some arcane reason.

Wolf0

@ghostlander said:

@Wolf0 I get it. I’ve also rewritten it. The code quoted is plain bytewise, though old VLIW GPUs like it for some arcane reason.

Odd. I got my 6970 today, so I should be able to work on Cayman in a while.