Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.

Spent many years on Reddit before joining the Threadiverse as well.

  • 0 Posts
  • 2.25K Comments
Joined 1 year ago
cake
Cake day: March 3rd, 2024

help-circle
  • That’s the neat thing, you don’t.

    LLM training is primarily about getting the LLM to understand concepts. When you need it to be factual, or are working with it to solve novel problems, you can put a bunch of relevant information into the LLM’s context and it can use that even if it wasn’t explicitly trained on it. It’s called RAG, retrieval-augmented generation. Most of the general-purpose LLMs on the net these days do that, when you ask Copilot or Gemini about stuff it’ll often have footnotes in the response that point to the stuff that it searched up in the background and used as context.

    So for a future Stack Overflow LLM replacement, I’d expect the LLM to be backed up by being able to search through relevant documentation and source code.



  • As I said above:

    mobs of angry people ignorant of both the technical details and legal issues involved in it.

    Emphasis added.

    They do not “steal” anything when they train an AI off of something. They don’t even violate copyright when they train an AI off of something, which is what I assume you actually meant when you sloppily and emotively used the word “steal.”

    In order to violate copyright you need to distribute a copy of something. Training isn’t doing that. Models don’t “contain” the training material, and neither do the outputs they produce (unless you try really hard to get it to match something specific, in which case you might as well accuse a photocopier manufacturer of being a thief).

    Training an AI model involves analyzing information. People are free to analyze information using whatever tools they want to. There is no legal restriction that an author can apply to prevent their work from being analyzed. Similarly, “style” cannot be copyrighted.

    A world in which a copyright holder could prohibit you from analyzing their work, or could prohibit you from learning and mimicking their style, would be nothing short of a hellish corporate dystopia. I would say it baffles me how many people are clamoring for this supposedly in the name of “the little guy”, but sadly, it doesn’t. I know how people can be selfish and short-sighted, imagining that they’re owed for their hard work of shitposting on social media (that they did at the time for free and for fun) now that someone else is making money off of it. There are a bunch of lawsuits currently churning through courts in various jurisdictions claiming otherwise, but let us hope that they all get thrown out like the garbage they are because the implications of them succeeding are terrible.

    The world is not all about money. Art is not all about money. It’s disappointing how quickly and easily masses of people started calling for their rights to be taken away in exchange for the sliver of a fraction of a penny that they think they can now somehow extract. The offense they claim to feel over someone else making something valuable out of something that is free. How dare they.

    And don’t even get me started about the performative environmental ignorance around the “they’re disintegrating all the water!” And “each image generation could power billions of homes!” Nonsense.


  • It’s a great new technology that unfortunately has become the subject of baying mobs of angry people ignorant of both the technical details and legal issues involved in it.

    It has drawn some unwarranted hype, sure. It’s also drawn unwarranted hate. The common refrain of “it’s stealing from artists!” Is particularly annoying, it’s just another verse in the never-ending march to further monetize and control every possible scrap of peoples’ thoughts and ideas.

    I’m eager to see all the new applications for it unfold, and I hope that the people demanding it to be restricted with draconian new varieties of intellectual property law or to be solely under the control of gigantic megacorporations won’t prevail (these two groups are the same group of people, they often don’t realize this).


  • This is an area where synthetic data can be useful. For example, you could scrape the documentation and source code for a Python library and then use an existing LLM to generate questions and answers about the content to train future coding assistants on. As long as the training data gets well curated for quality it’s perfectly useful for this kind of thing, no need for an actual forum.

    AI companies have a lot of clever people working for them, they’re aware of these problems.



  • I’m a fan of the Machete Order.

    There may be some spoilers in that blog post, it’s been a while since I read it, so here it is in summary:

    • A New Hope (4)
    • Empire Strikes Back (5)
    • Attack of the Clones (2)
    • Revenge of the Sith (3)
    • Return of the Jedi (5)

    Phantom Menace is omitted because it’s the weakest of the prequel trilogy and everything that happens in it is summarized at the beginning of Attack of the Clones anyway. If you want to be a completionist then watch it between Empire Strikes Back and Attack of the Clones.

    There’s good reasons for following this order, but it’s hard to describe them without spoiling anything. Basically, Lucas assumed you’d watched the original trilogy when he made the prequels, so it’s got a bunch of spoilers in it that the Machete Order preserves quite nicely.






  • So they’re still feeding LLMs their own slop, got it.

    No, you don’t “got it.” You’re clinging hard to an inaccurate understanding of how LLM training works because you really want it to work that way, because you think it means that LLMs are “doomed” somehow.

    It’s not the case. The curation and synthetic data generation steps don’t work the way you appear to think they work. Curation of training data has nothing to do with Yahoo’s directories. I have no idea why you would think that’s a bad thing even if it was like that, aside from the notion that “Yahoo failed therefore if LLM trainers are doing something similar to Yahoo then they will also fail.”

    I mean that they’re discontinuing search engines in favour of LLM generated slop.

    No they’re not. Bing is discontinuing an API for their search engine, but Copilot still uses it under the hood. Go ahead and ask Copilot to tell you about something, it’ll have footnotes linking to other websites showing the search results it’s summarizing. Similarly with Google, you say it yourself right here that their search results have AI summaries in them.

    No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.

    The problem with your understanding of this situation is that Google’s search summary is not solely from the LLM. What happens is Google does the search, finds the relevant pages, then puts the content of those pages into their LLM’s context and asks the LLM to create a summary of that information relevant to the search that was used to find it. So the LLM doesn’t actually need to have that information trained into it, it’s provided as part of the context of the prompt,

    You can experiment a bit with this yourself if you want. Google has a service called NotebookLM, https://notebooklm.google.com/, where you can upload a document and then ask an LLM questions about the documents’ contents. Go ahead and upload something that hasn’t been in any LLM training sets and ask it some questions. Not only will it give you answers, it’ll include links that point to the sections of the source documents where it got those answers from.







  • Betteridge’s law of headlines.

    Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.

    Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.

    But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.




OSZAR »