Ask Lemmy
A Fediverse community for open-ended, thought provoking questions
Please don't post about US Politics. If you need to do this, try [email protected]
Rules: (interactive)
1) Be nice and; have fun
Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them
2) All posts must end with a '?'
This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?
3) No spam
Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.
4) NSFW is okay, within reason
Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either [email protected] or [email protected].
NSFW comments should be restricted to posts tagged [NSFW].
5) This is not a support community.
It is not a place for 'how do I?', type questions.
If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email [email protected]. For other questions check our partnered communities list, or use the search function.
Reminder: The terms of service apply here too.
Partnered Communities:
Logo design credit goes to: tubbadu
view the rest of the comments
Technically not my industry anymore, but: companies that sell human-generated AI training data to other companies most often are selling data that a) isn't 100% human generated or b) was generated by a group of people pretending to belong to a different demographic to save money.
To give an example, let's say a company wants a training set of 50,000 text utterances of US English for chatbot training. More often than not, this data will be generated using contract workers in a non-US locale who have been told to try and sound as American as possible. The Philippines is a common choice at the moment, where workers are often paid between $1-2 an hour: more than an order of magnitude less what it would generally cost to use real US English speakers.
In the last year or so, it's also become common to generate all of the utterances using a language model, like ChatGPT. Then, you use the same worker pool to perform a post-edit task (look at what ChatGPT came up with, edit it if it's weird, and then approve it). This reduces the time that the worker needs to spend on the project while also ensuring that each datapoint has "seen a set of eyes".
Obviously, this makes for bad training data -- for one, workers from the wrong locale will not be generating the locale-specific nuance that is desired by this kind of training data. It's much worse when it's actually generated by ChatGPT, since it ends up being a kind of AI feedback loop. But every company I've worked for in that space has done it, and most of them would not be profitable at all if they actually produced the product as intended. The clients know this -- which is perhaps why it ends up being this strange facade of "yep, US English wink wink" on every project.
A couple decades ago I worked for a speech recognition company that developed tools for the telephony industry. Every week or two all the employees would be handed sheets of words or phrases with instructions to call a specific telephone extension and read them off. That’s how they collected training data…
I'm not surprised tbh. Having perused some of the text training datasets they were pretty bad. The classification is dodgy too. I ended up starting my own dataset because of this.
What do you mean with 'classification'? Sentimwnt analysis?