Technology

70285 readers

4310 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

675

Mozilla lays off 60 people, wants to build AI into Firefox (arstechnica.com)

submitted 1 year ago by [email protected] to c/[email protected]

316 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

ONNX Runtime is actually decently well optimized to run on CPUs; even with large models. However, the simple truth is that there’s really no escaping that Billion+parameter models need to be quantized and even pruned heavily to fit in memory and not saturate the CPU cache so inferences/generations don’t take forever. That’s a reduction in accuracy, so the quality of the generations aren’t great.

There is a lot of really interesting research and development being done right now on smart quantization and pruning. Model serving technologies are improving rapidly too—paged attention is a really cool technique (for transformer based models) for effectively leveraging tensor core hardware—I don’t think that’s supported on CPU yet but it’s probably not that far off.

It’s a really active field and there’s just as much interest in running huge models on huge hardware as there is big models on small hardware. I recently heard of layerwise inference for CPUs; load each layer of the network to the CPU cache on demand. That’s typically a bottleneck operation on GPUs but CPU memoery so bloody fast that it might actually work fine. I haven’t played with it myself, or read the paper all that deeply so I can’t really comment more than it’s an interesting idea.