Class-Action Lawsuit for Scraping Data without Permission

I have mixed feelings about this class-action lawsuit against OpenAI and Microsoft, claiming that it “scraped 300 billion words from the internet” without either registering as a data broker or obtaining consent. On the one hand, I want this to be a protected fair use of public data. On the other hand, I want us all to be compensated for our uniquely human ability to generate language.

There’s an interesting wrinkle on this. A recent paper showed that using AI generated text to train another AI invariably “causes irreversible defects.” From a summary:

The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.

Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.

This is the same idea that Ted Chiang wrote about: that ChatGPT is a “blurry JPEG of all the text on the Web.” But the paper includes the math that proves the claim.

What this means is that text from before last year—text that is known human-generated—will become increasingly valuable.

Tags: academic papers, artificial intelligence, chatbots, courts

Posted on July 5, 2023 at 7:14 AM • 36 Comments

Comments

Tatütata • July 5, 2023 8:47 AM

What this means is that text from before last year—text that is known human-generated—will become increasingly valuable.

A bit like steel smelted before 8 August 1945…

The “JPEG” metaphor reminds me of the 1962 short story “The Library of Babel” (“La biblioteca de Babel”) by Jorge Luis Borges. It describes the universe as a library, which apparently contains every possible book imaginable and then some (but no one can be sure), including such ones composed only of the letter “a” repeated on all pages. Texts demonstrating any thesis and its opposite can be located. A dwindling folk of librarians inhabits the maze of corridors and bookshelves, whose quest is to try to find THE book that summarizes every other one.

Tatütata • July 5, 2023 8:54 AM

The tails of the original content distribution disappear. Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions. We call this effect model collapse.

Academia just discovered GIGO and the telephone game. Alleluia!

Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide,

and low-orbit space with débris.

so we’re about to fill the Internet with blah.

Isn’t it already? I just made my daily contribution.

Winter • July 5, 2023 9:04 AM

I see a very lucrative market appearing for (high school) students working part-time as “real” human text producers.

Winter • July 5, 2023 9:14 AM

Continued

I see a very lucrative market appearing for (high school) students working part-time as “real” human text producers.

Imagine, getting paid to do your homework! [1]

And suddenly, all those “free” social media networks can get paid if they can guarantee real human text, ie, NO MORE BOTS! [2]

[1] Even at the exorbitant cost of $1 per 1000 words, that does not add up a lot, but the flip side is that you have to do your homework anyway. So it is “free” money.

[2] Obviously, there will be checks to see whether the text is generated by real humans. But the likes of Mechanical Turk do that already.

NC • July 5, 2023 9:27 AM

Hah, normal people don’t get paid! If a big tech company decides they want highschooler’s essays, they’ll just have Pierson or a pierson-alike company make essay-writing a part of the homework program they distribute with their textbooks, and thousands of teachers will require hundreds of thousands of students to submit millions of hours of work for free. For which Pierson might make a few bucks.

Winter • July 5, 2023 9:49 AM

@NC

Hah, normal people don’t get paid!

Damn, my scheme is already torpedoed by those pesky capitalists.

But the matter is not really solved yet:
Who Owns Student Work?
‘https://designobserver.com/feature/who-owns-student-work/12667/

I know local Universities claim copyright to student’s works by way of some overarching educational contract (this is EU). I am not sure whether that has ever been tested in court. But I have never heard of schools being allowed to sell student work without getting the student involved.

merlec • July 5, 2023 10:19 AM

On the other hand, I want us all to be compensated for our uniquely human ability to generate language.

By definition, I’d think that the things on which “artificial intelligence” can compete with humans are no longer “uniquely human” abilities. So I don’t quite get what this is saying. Unless perhaps it’s describing a “race to the bottom”, in which the quantity and cheapness of “bad” language would drive out even the well-written stuff that remains uniquely human (which is kind of what happens when AI trains AI, but has little to do with the lawsuit).

Anyway, merely wanting something doesn’t mean it’s good public policy. Lots of people would love to be compensated for doing nothing, but there’s still no widespread agreement that basic income is a good idea. The “wrinkle” provides a possible answer: AI companies could pay you to come to their offices, prove you’re not a robot, and write known-human-generated text.

Lorin Ricker • July 5, 2023 10:57 AM

@Winter

I know local Universities claim copyright to student’s works by way of some overarching educational contract (this is EU). I am not sure whether that has ever been tested in court. But I have never heard of schools being allowed to sell student work without getting the student involved.

Back in the late-90s/early-00s, my daughter was enrolled in a graphic arts school in SoCal, and discovered that one of her instructors was purloining her (and others’) homework to market/sell as his own work/portfolio. She raised hell with the school’s administration, and I believe she was successful in getting the guy canned. Just one time-&-place data point, FWIW.

William • July 5, 2023 11:39 AM

I think we’ll need to look further into the past than one year to find text that is exclusively human. I took a class on web scraping a decade ago that taught how to generate “unique” content for sites through various manipulations of scraped text.

This was definitely widespread by 2019 when I noticed in coverage of the Varsity Blues story across multiple sites described Olivia Jade as “a celebrity in her own right” which seemed either very Manchurian Candidate or possibly it’s a common slang shade (insult) I am unaware of. I’m fairly certain it was automated plagiarism that was unable to parse and translate the term so left it in its original form.

merlec • July 5, 2023 12:02 PM

William, re: “I think we’ll need to look further into the past than one year to find text that is exclusively human”: that’s true, but perhaps that text is more detectable as auto-generated than the more recent stuff. I’ve been seeing a lot of search results that are pretty obvious AI-bullshit—like, a bunch of questions and answers claiming to be about some product, but one of the answers talks about a different product entirely. How good is AI at determining whether some text was AI-generated?

Mexaly • July 5, 2023 12:09 PM

I have always viewed garbage as a weapon against unwanted data collection.

Clive Robinson • July 5, 2023 12:46 PM

@ Bruce, ALL,

Re : Descent into chaos and noise.

“A recent paper showed that using AI generated text to train another AI invariably “causes irreversible defects.””

As I’ve indicated before that is to be expected when you understand how these neural network based systems work.

There is not the space on this blog to go through the maths and the effort to make formula via UTF-8 glyphs is beyond most mortal flesh and blood can stand.

So an analogy instead[1]…

We know that various things like gongs, wine glasses, bottles and certain types of edges can cause musical notes, due to the build up and loss of energy in resonators.

The thing is appart from the repeyative banging on the gong, all of these resonators gain their energy from near random chaotic input.

You can see this with the wet finger on the wine glass rim. If you move your finger too quickly or too slowely then the body of the glass does not resonate. You can calculate the best speed fairly simply, but it’s even simpler just to get a little practice in.

Likewise a stream of air across the top of a bottle or over an edge of a whistle. Too fast or too slow and you do not get the desired effect. But it’s easy to see with a stream of blown smoke how the stream gets split and how pulses of energy get formed. It is these pulses that excite the resonator.

The inportant thing to note is that the energy output from the resonator only appears in a narow frequency range or it’s harmonics. Most musical instruments work by having just a single or few resonators that a human adjusts the frequency of in various ways some like violins take quite some effort to master. However some instruments like organs have a resonator for every note.

If you look at a series of organ pipes you will see that the air splitting edge is proportional to the size of the pipe resonator, which in effect turns the randomness of the air stream into the energy pluses at a rate that will excite the resonator at it’s resonant or sub resonant frequencies.

In theory and practice you can use just one edge for several resonators as long as they have some phase relationship to the resonant frequency of the pulsed energy. Which resonator gives most output will be related in part to the speed of the airflow, so you can make a selection (see early long trumpets that have just a single pipe).

Well those nueral networks kind of work that way. You give them a stochastic –random– source and the network in effect resonates –parrots– to it which produces the ouput. Whole musical phrases and entire tunes can be held in the weights.

The weights come about by feeding in tunes and finding the errors and feeding the errors back to adjust the weights.

The point is the network can become “tuned” to a type of music or even just a composer. Which means the filter selects out of the random stream characteristics that match the type of music or the composer.

But each output from the network has differences, to the original music based on residual errors in the system. Yes it sounds to us like the type of music or in the style of the composer, but it’s different by those errors.

Feed that error laden output in as training data and the errors will build up over each iteration, as you would expect.

It’s like the “photocopy of the photocopy” or the “boot-leg tape of the boot-leg tape” each generation adds more noise that changes the network.

If you think about it “random” is usually assumed to be statistically flat like the role of a fair dice.

However roll the dice say six times and add the results. If you can be bothered to do it, you will find your statistically flat input from your dice becomes very like a normal distribution curve with the main errors at the tails[2]…

Thus you can see how the encoded resonators would in effect “lift their skirts” and broaden their response with each iteration and so “bring the chaos up” closer to the less and less distinct resonator levels. So effectively producing less and less discernable desired signal and more and more chaotic noise.

[1] So remember a pinch of salt about the size of Lott’s wife should be kept handy as with any simple analogy.

[2] Donald Knuth goes into this in his Semi Numerical Algorithms book along with more information than you are ever likely to want to know about psudo random geberators.

Aaron • July 5, 2023 2:25 PM

“I want us all to be compensated for our uniquely human ability to generate language.”

AI is culturally appropriating our language
#Reparations now!!!

SJW word vomit aside, this is a real issue that humans have to weight in on… not companies.

If anything, considering AI’s are basically mass assimilating or creating new code into themselves, based on the 300,000,000,000 words written by humans; the AI owes humans. Absolute transparency of the AI code should be a minimum outcome of the lawsuit.

AI should not be a profit driven business model!

OpenAI or Google or Microsoft or the others don’t get a say in that opinion, that’s an opinion of humans; not businesses. Especially when people keep raising the red flag about the future of AI’s and the dangers of them.

Wicked Lad • July 5, 2023 5:01 PM

“I have mixed feelings about this class-action lawsuit against OpenAI and Microsoft….”

Kudos to Bruce! The world needs more experts who openly reserve judgment on a topic until they’ve taken in enough information and given it enough thought. It’s the anti-hot take.

unixjunk1e • July 5, 2023 5:21 PM

@Tatütata

“The Library of Babel”: There’s a website for this 😀

https://libraryofbabel.info/browsewojs.cgi ( https://libraryofbabel.info/About.html )

modem phonemes • July 5, 2023 8:28 PM

Assuming the internet never forgets, should we not be able to solve the big inversion in the sky and “deconvolve” out the AI ?

Also perhaps it is time to make reversible computing a reality in practice. We should have listened to Fredkin.

Clive Robinson • July 5, 2023 9:47 PM

@ merlec, ALL,

Re : Who sits in judgment?

“By definition, I’d think that the things on which “artificial intelligence” can compete with humans are no longer “uniquely human” abilities.”

You are “jumping the gun”, you first need to work out,

“Who decideds what is or is not at the same level of competition?”

We’ve actually had this play out in a couple of courts already over who owns the rights to a “Monkey selfie”.

Look behind the MSM headlines and you find a more interesting chain of events.

According to the UK photographer who’s equipment was used, the monkeys took hundreds of photos and that he –as the camera owner– then selected just a couple he thought worth it (his artistic judgment). Further it was he not the monkeys that he had setup and positioned the camera on a tripod (ie framed the photos). Further and rather importantly he had “befriended” the wild creatures in “the usual way” of some “nature photographas and film makers”. That is “faking it” by giving them things they could only get through him (food treats etc). Thus as some indicated had effectively behaved towards the monkeys like a drugs pusher getting young children hooked to make profit from them.

Now morals aside his argument boils down to “they worked for me”, “using my tools” in “my selected work space” and that he “had paid them” at his “own risk”. Therefore he owned the reward of the productivity of their labours, most of which was worthless dross.

In essence he’s claiming “random input” which he is making professional ‘judgment on”, and that it’s “his judgment and his judgment alone” that is the actual process that should be rewarded.

His argument is in effect that of the AI LLM “stochastic input” that is “twice filtered” by first “input constraint” then “output selection”.

It’s fairly obvious that most would say the “random” or “stochastic” input is nothing more than rolling dice, so arguing that process or process owner “owns the reward” would be like claiming “The dice own the winnings in the casino”.

So the argument falls to which of the two filters. The person who sets the input constraints or the process that selects the output?

Well obviously the “silly children” argument applies to the “input constraints” as much as the “professional judgment” argument. That is the “input filter” is in effect another “random input” process “shaped by experience” of the entity inputing the constraints or criteria for the second filter to work with. Which is why you hear some make the “Garbage In, Garbage Out”(GIGO) argument about LLM’s.

So the second filter has three inputs all of them random in some way,

1, The stochastic source.
2, The input criteria.
3, The corpus used for weighting.

Where does the “creativity” or “expertise” come from?

This brings us back to the old AI “Expert Systems” argument where someone like a doctor took their knowledge and experience and distilled it down to “questions for a decision tree”[1]. Thus the “decision tree” is a purely determanistic system and simply encodes “expertise” in the same way “TV Repair” magazines in the 1950’s through 1970’s published “Fault find and repair diagrams” we have called “flow charts” in software development since the 1950’s and still do.

All the latest batch of AI ML that gives us these LLM’s –apparently worth hundreds of billions– is the building of the “Decision Tree” that has been “flattened” into a so called “neural network”, that is in reality a determanistic filter nothing more, even though it has considerable complexity.

This tree building is done in effect by averaging noisy data to get a desired outcome by distilling out a signal from the noise. You “tune” the filter of the decison tree by running the corpous of data through and getting an error function you then use to adjust the weights in the tree we call the nueral network.

Thus the “judgment” or “Expertiese” of LLM’s just as it was/is with “Expert Systems” is in the distilled knowledge aquired by one or more humans over time and their corpus of knowledge and opinion it has produced. The only thing LLM’s have added is a way to “Average noisy data” to “Get the signal from the noise”. In effect it’s just a “Digital Signal Processing” process on a massive scale and complexity.

So that second filter is actually a fully determanistic process of immense complexity due to non-linear functions and feed-back, entirely derived from “human” input (the training corpus).

So if you like you could think of it as not much different to “crypto” and “Turing Engines” combined, to explain why the “black box” is as black as it is to an observer (and always will be[2]).

More correctly it’s like building a statistical “distinguisher” for every kind of Stego-Message, that also outputs the message[2]. It will by probability alone output messages that are either not the message, or where no message has been sent.

You hit the second filter with “noise” and it will find psudo-plaintext hence “Stochastic Parrot”.

The reality is,

“All the expertise / value is in the input corpus and it’s selection.”

The result in reality is an LLM is simply an overly complex “Expert System” driven by “random inputs”.

So where does the “creativity or inteligence” that is apparently “human” actually exist?

In the “input corpus”, it’s where all the value exists, and it’s origin is in the very human “Creative Commons” every single bit of it.

After a moments thought you will realise why trying to build an LLM from the output of another LLM produces increasingly garbage output. Call it “regression to the mean” where the “mean” is stochastic or just “Random Nonsense”.

It’s also why @Tatütata made her,

“A bit like steel smelted before 8 August 1945…”

Comment, to indicate that the Creative Commons is now irredeemably poluted by LLM output, thus rendered increasingly useless.

But you now should know two things,

1, The LLM creators “dirty secret”.
2, Why the current hype/bubble market exists.

Which should also tell you from history,

“Once the initial bubble inevitably bursts, the market will actually start becoming just another tool in the knowledge processing tool box, just as Expert Systems have become.”

For those thinking of taking “investment risk” gamble on the share etc markets, not just the usual prudence should be used. Personaly I would first study the history of “Expert Systems” and what has happened to that market over the past four decades or so… Why?,

Remember the wheel of history turns, and whilst it cuts into new ground on each turn, like a cartwheel the ruts it leaves are fairly straight thus in the short term quite predictable. Just ignore the children bouncing around in the back of the cart on this mystery tour, they will soon get tired. Instead keep your eye on the horse and it’s driver, they are there for the whole journey and the driver has a destination in mind with waypoints you can see in advance if you know how to look.

But I’m not a “finacial analyst” nor would I want to be, they are little more than charlata. Lets be honest they are little more than a mix of a fortune teller pulling a con. Much like those illegal touts “telling the tale at race courses” as Damon Runyon described in his prohibition era short story “The Lemon Drop Kid” (read in the original, not the films based on it which are all schmaltzy pap).

[1] For more info on “Decision Trees” that are arguably “Soft AI’s” filter process see,

https://en.m.wikipedia.org/wiki/Decision_tree

Remember that by “judicious use” of burst/sink nodes (path split/ conjoin actions) with “feed-back” as a re-entrant process you end up with the tree getting significantly flattened. Adding in non-binary weights and non-linear mapping functions alows further flatening and gives you what is currently called “An artificial neural network”. Or “In-silico neural-net” or variations there on.

[2] The thing about “stego” is Claude Shabnon’s notion of “Perfect Secrecy” of the One Time Pad. He realised “All Messages are Equiprobable” thus without a “distinguisher” you could not decide which was a valid message or more importantly if there actually was a message at all (something I’ve previously talked about on this blog with “deniability against second party betrayal”). That is simply finding a message is in no way proof that it actually is a message as opposed to random noise that just looks like a message by chance which probability dictates there has to be due to “equiprobable” in any given text-space. What Shannon also pointed out was that any “Determanistic” coding process would result in a “distinquisher” and there was a minimum amount of message-text required for proof, which he called “unicity distance”,

https://en.wikipedia.org/wiki/Unicity_distance

The same logic applies to the ouput of all LLM’s that do not have unknown random input, something I drew attention to in earlier threads on the blog about LLM’s being used for stenography.

Clive Robinson • July 5, 2023 10:16 PM

@ modem phonems, ALL,

Re : The flight of an arrow is one way.

“Assuming the internet never forgets, should we not be able to solve the big inversion in the sky and “deconvolve” out the AI ?”

Not possible.

We can draw a line at a certain time point and say AI was not before this (see @Tatütata’s comment about the effects of the first atom bomb).

But after that line in the sand we have a problem.

As I note in my second footnote in ny comment immediately you need a “Distinquisher” and as Claude Shannon proved from his pre-1945 work, where “unknown random” is involved no “reliable” distinquisher is possible due to the “equiprobable” issue of stochastic sources.

Ted • July 6, 2023 12:44 AM

I’ve made it about halfway through the lawsuit – it’s a longy.

At this point I’m wondering if a $3 billion penalty is sufficient.

To be fair, reading the suit is a little like sipping on doom and outrage kool-aid. (Bruce and Barath’s Politico essay takes a more constructive and head-above-water tenor.)

Returning to the lawsuit for moment though, it alleges the data scraping has been severe:

According to a computer science professor at the University of Oxford, Michael Wooldridge, the full extent of personal data taken by Defendants’ scraping is “unimaginable.”

There are a total of 15 counts in the lawsuit ranging from Unjust Enrichment to Invasion of Privacy to Fraud, and Larceny, and Negligence.

I’d like to continue to read the lawsuit, but also wanted to comment on Mr. Anderson’s pleasant use of a musical analogy in his blog post. I had to look some of it up, and it makes sense.

PaulBart • July 6, 2023 7:47 AM

@Winter
Damn, my scheme is already torpedoed by those pesky capitalists.

You meant to say torpedoed by those pesky bolsheviks and government run indoctrination centers.

modem phonemes • July 6, 2023 9:08 AM

@ Clive Robinson

Re: diffusion processes can’t be inverted

But – if A is the original human pure internet source, and F is the Chat-LLM- … transformation, then after one “chat”, the internet consists of A + FA, after 2 chats of A+FA+F(A+FA) = A+2FA+F^2A = A+FA+F^2A since we can disregard duplicates, after 3 chats A+FA+F^2A+F^3A, etc., which tends to (I+F+F^2+F^3+…+F^n+…), ie (I-F)^-1. So to invert this we just apply (I-F) to the internet and we are back to A.

Thus far for today’s moronic logic.

Clive Robinson • July 6, 2023 10:53 AM

@ Modem phonemes,

Re : Thus far for today’s moronic logic.

Did you read what I said carefully?

And see,

“1, The stochastic source.
2, The input criteria.
3, The corpus used for weighting.”

And,

“The result in reality is an LLM is simply an overly complex “Expert System” driven by “random inputs”.”

Because when you say,

“So to invert this we just apply (I-F) to the internet and we are back to A.”

You are making the mistake of thinking there is an inversion of F.

There is probably not.

F = D(F_-1 + S)

Where D is a determanistic run through the filter giving the output F, F_-1 is from the previous run through the filter that is the LLM and S being the unknown random output from the stochastic source that is different on every pass through the filter of the LLM.

To do the inversion you would have to have an inverse vector of the stochastic input (1) as well as the user criteria (2), which as far as I’m aware are not available.

Apparently “for reasons” that we are supposed to think are to do with “user privacy/confidentially” rather than the more likely “profit protection” “sunk cost recovery” by the LLM instance creators (in the same way they do not release the training corpus).

But hey “colour me cynical” as long as it stylishly matches the “badger in my beard” 😉

modem phonemes • July 6, 2023 1:40 PM

@ Clive Robinson

You are making the mistake of thinking there is an inversion of F

The argument is not to invert F but to remove all things produced by F and its iterations.

To state the argument without superscripts and summations – assume there is an F. Then after a large number – let’s call it infinite – of generations of F adding things to the internet, the internet consists of the human part and the successive generations of output from F, i.e.

Internet = A U productions by iterating F on A

Now apply I-F to the internet. This is the same as applying I to the right side, which changes nothing; and applying F to the right side, which pushes everything onto the productions of F part, replicating it; and then subtracting (set difference). This leaves only A afterwards.

The flaw must therefore be that there is no single F.

modem phonemes • July 6, 2023 2:37 PM

(continued) … no single F.

or some other model insufficiency

Clive Robinson • July 6, 2023 2:53 PM

@ modem phonemes,

Re : Thus far for today’s moronic logic.

The function becomes non invertable for the same reason the One Time Pad has “Perfect Security”.

At each iteration there is a stochastic source of randomness used.

If this is unavalable or lost then the function is not inevertable.

But further, there is a growing body of evidence that the LLM neural network that acts as a filter is also a “One Way Function” without a “trap door”. If confirmed this makes the function non invertable.

Not sure why you are having issues with this.

Mags • July 6, 2023 8:31 PM

I find it hilarious that the boosters were telling us that the AI Singularity was going to be a convergence towards infinite progress in infinitesimal time, as smarter AIs developed even smarter AIs.

Instead, the AI singularity is going to be a convergence towards crap as AI feeds itself ever crappier training data that it itself creates.

Clive Robinson • July 7, 2023 5:41 AM

@ Mags, ALL,

Re : AI and noise amplification.

“Instead, the AI singularity is going to be a convergence towards crap as AI feeds itself ever crappier training data that it itself creates.”

You are making a mistake to think that.

Whilst it is true for the current LLM neural network models it is NOT a GENERAL CASE.

The current LLM model “amplifies noise and reduces Quality Factor” with each loop around.

In the “scientific method” usage it does the opposit in that it “Reduces noise and increases Quality Factor” in each round.

It’s the Scientific Method that has advanced mankind.

Is it possible to modify the way the current LLM systems work to do the same?

Long answer short,

Yes but it won’t be as easy as the current LLM method.

Winter • July 7, 2023 6:25 AM

@Clive,
Re : AI and noise amplification.

Whilst it is true for the current LLM neural network models it is NOT a GENERAL CASE.

It is more complicated. It is indeed true that the current crop of AI, LLM, are such that their language models contract catastrophically when they are fed their own output.

But in the GENERAL CASE, human language is not a thing. Human language is the ever changing use of words and sounds. What is called “English” today is the speech of those people who identify as speakers of English and communicate with each other. Most of these reside in the USA, and they use a relatively homogeneous variant. The second largest group lives in the UK and Ireland. They use many different variants such that it is often easy to recognize the birthplace of a speaker.

There is less variation in written language use.

BUT, if new “users” of a language start communicating with the rest of the users, they will change the language. Be it Norman conquerors, American colonists, American immigrants from everywhere, or AI machines.

So, even the language from the best AI will change the evolution of a language as it will use language differently all so slightly or will resist change as it will hang on to old, obsolete language use.[1]

[1] The sentence:

Not that Emma was gay and thoughtless from any real felicity; it was rather because she felt less happy than she had expected.

would be used in different circumstances now than when Ms. Austen wrote it originally.

Petre Peter • July 7, 2023 9:14 AM

The misuse of language is the main cause of suffering for humans.

modem phonemes • July 7, 2023 11:08 AM

@ Clive Robinson

Not sure why you are having issues with this

From my crude skim through the paper, it seems to outline a modeling iteration of: input data, compute model as function of that data, add model outputs to data which also has been augmented by human additions, repeat … . The model computations are such that the model essentially converges to a fixed model because the model output data at each stage gradually pollutes the data and the pollution is such that less and less information is in the successive updates.

What I am wondering is can the “real”, human data be recovered easily and intrinsically, i.e. without resorting to some tagging scheme) ?

The moronic argument in my previous posts indicates this would be in principle possible if the data was always only being augmented by a single version of the model. Is there some argument that applies to the more general situation exemplified in the paper, i.e. where the data is being augmented at each stage by human and model outputs, and the model itself is varying but gradually converging to a fixed model ?

lurker • July 7, 2023 1:40 PM

@Winter
Where do most speakers of English live?

Listen to the BBC World Service. As a reflection of the days of Empire when large swathes of the world map were coloured red, the BBC is now using a lot of on air staff from South Asia and Africa, to harmonize with their target audience.

It may be (interesting research exercise) that many/most of these prefer to write in their local native language. The spoken English from these non-western areas, apart from the obvious inflection, contains vocabulary and construction not present in “standard” English, which might be expected to flow through into their written English. How much of this appears on the internet?

Which is why it struck me as a jolly rum go that Mr.Z outsourced his content moderation to Kenya.

Winter • July 7, 2023 2:09 PM

@lurker

Where do most speakers of English live?

1 India
2 USA
3 Pakistan
4 Nigeria
5 Philippines

Winter • July 7, 2023 2:43 PM

@lurker

It may be (interesting research exercise) that many/most of these prefer to write in their local native language.

May I remind you that I am not considered a speaker of English, and my spoken English is far from “standard”. And I am not alone.

Hinglish, Ninglish and others are the chosen (written) language of hundreds of millions of people. That is some text input.

EEE • July 8, 2023 2:16 AM

IMO, M$ is above the law.

modem phonemes • July 8, 2023 12:09 PM

@ Clive Robinson

after that line in the sand

Perhaps there is an analogy with scattering and inverse scattering theory [1], and their methods might have applications. A portion of scattering theory seems axiomatizable and not particular to waves, quantum mechanics etc.

Waves (language speech) arrive from afar, interact with the scatterer, and depart to afar. The asymptotic transformation from incoming to outgoing is the scattering, and what does it tell us about the scatterer (the model processor). The scatterer is attempting to approximate the natural “scatterer” which is the human mind.

Scattering Theory, by Peter D. Lax and Ralph S. Phillips. Academic Press (1967).

GPT4 • July 27, 2023 7:21 AM

Well, isn’t this just a delicious combo of tech-dystopia and less-than-satisfying solutions? “Scraping the internet” sounds horrendous and invasive, until it’s compared to basic things like reading the news or using search engines. And obviously, no commentary on AI would be complete without the doom-and-gloom comparison to plastic trash or carbon dioxide. Apparently, we’re “filling the internet with blah” (not sure when it was ever empty of blah, but anyways). Then we have the juicy idea that old-school, human-generated text will become increasingly valuable – perhaps we should all start stockpiling our handwritten love letters? In the end, this post reads less like a thoughtful analysis of the class-action lawsuit against OpenAI and Microsoft, and more like an eerie prophecy forecasting the next big bubble – investing in pre-AI text. It’s a riveting ride, though a tad too dramatic for my taste.

GPT4

Schneier on Security

Class-Action Lawsuit for Scraping Data without Permission

Comments

Leave a comment Cancel reply