this post was submitted on 09 Mar 2024
62 points (94.3% liked)

Technology

59374 readers
3671 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

Wikifunctions is a new site that has been added to the list of sites operated by WMF. I definitely see uses for it in automating updates on Wikipedia and bots (and also for programmers to reference), but their goal is to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information. I have mixed feelings about this, as I don't like existing programs that automatically generate articles (see the Cebuano and Dutch Wikipedias), and I worry that the system will be too complicated for average people.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 5 points 8 months ago* (last edited 8 months ago) (11 children)

but their goal is to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information

That'll get unruly really fast.

Languages simply don't agree on how to split the usage of words. Or grammatical case. Or if, when and how to do agreement.

Just for the sake of example: how are they going to keep track of case in a way that doesn't break Hindi, or Basque, or English, or Guarani? Or grammatical gender for a word like "milk"? (not even the Romance languages agree in it.) At a certain point, it gets simply easier to write the article in all those languages than to code something to make it for you.


I think that the best use scenario is to automate tidbits of highly changing data. It's fairly limited but it could be useful.

[–] [email protected] 5 points 8 months ago (7 children)

Languages simply don’t agree on how to split the usage of words. Or grammatical case. Or if, when and how to do agreement.

Just for the sake of example: how are they going to keep track of case in a way that doesn’t break Hindi, or Basque, or English, or Guarani? Or grammatical gender for a word like “milk”? (not even the Romance languages agree in it.) At a certain point, it gets simply easier to write the article in all those languages than to code something to make it for you.

I don't know what the WMF is planning here but what you're pointing out is precisely what abstraction would solve.

If you had an abstract way to represent a sentence, you would be independent of any one order or case or whatever other grammatical feature. In the end you obviously do need actual sentences with these features. To get these, you'd build a mechanism that would convert the abstract sentence representation into a concrete sentences for specific languages that is correctly constructed according to those specific languages' rules.

Same with gender. What you'd store would not be that e.g. some german sentence is talking about the feminine milk but rather that it's talking about the abstract concept of milk. How exactly that abstract concept is represented in words would then be up to individual languages to decide.

I have absolutely no idea whether what I'm talking about here would be practical to implement but it in theory it could work.

[–] [email protected] -1 points 8 months ago (6 children)

Abstractions are not magic, and they cannot make info appear out of nowhere. Somewhere inside that abstraction you'll need to have the pieces of info that Spanish "leche" [milk] is feminine, that Zulu "ubisi" [milk] is class 11, that English predicative uses the ACC form, so goes on.

And you'll need people to mark a multitude of distinctions in their sentences, when writing them down, that the abstraction layer would demand for other languages. Such as tagging the "I" in "I see a boy" as "+masculine, +older-person, +informal" so Japanese correctly conveys it as "ore" instead of "boku", "atashi, "watashi" etc.

Even the idea of "abstract concept of milk" doesn't work as well as it sounds like, because languages will split even the abstract concepts in different ways. For example, does the abstract concept associated with a living pig includes its flesh?

And the language itself cannot decide those things. A language is not an agent; it doesn't "do" something. You'd need people to actively insert those pieces of info for each language, that's perhaps doable for the most spoken ones, but those are the ones that would benefit the least from this.

[–] [email protected] 0 points 8 months ago (1 children)

This is an encyclopedia, so there are no pronouns like "I", so this simplifies this issue. The remaining ones are in the third person, and if we link them to data about the person that is referred to it would solve this. A longuist doesn't necessarily need to know a language in order to analyze its grammar, and a lot of the work needed in Wikifunctions is like this.

[–] [email protected] 0 points 8 months ago

This is an encyclopedia, so there are no pronouns like “I”, so this simplifies this issue. The remaining ones are in the third person, and if we link them to data about the person that is referred to it would solve this.

The pronoun is an example. You are confusing the example with the issue.

This issue is that, if some language out there marks a distinction, whoever writes the abstract version of the text will need to mark it, as that info won't "magically" pop out of nowhere. The issue won't appear just in the pronouns, but every where.

A longuist [linguist] doesn’t necessarily need to know a language in order to analyze its grammar, and a lot of the work needed in Wikifunctions is like this.

Usually when you aren't proficient in a language but still describing it, you focus on a single aspect of its grammar (for example, "unergative verbs") and either a single variety or a group of related ones.

What the abstract version of the text would require is nowhere close to that. It's more like demanding the linguist to output a full grammar, to usable levels, of every language found in Wikipedia, to write down a text about some asteroid, using a notation that is cross-linguistically consistent and comprehensible.

Also note that descriptions coming from linguists who are not proficient in a variety in question tend to be poorer.

load more comments (4 replies)
load more comments (4 replies)
load more comments (7 replies)