Once again, Artificial Intelligence (AI) appears at the center of criticism, in this case by GitHub users. There, the spotlight is on Microsoft for training an AI tool with its code. Apparently, “Copilot” was trained using billions of lines of open source code found on different sites, and one of them is GitHub. Because of this situation, open source coders are considering a class action lawsuit against the firm for training the AI tool.
A website has been generated to spearhead research on the company. Matthew Butterick, a programmer and attorney, is assembling a class action litigation team. With it, it is hoped that a lawsuit can be filed against the tool in question, called GitHub Copilot. Many have been suspicious of Microsoft, following the launch of Copilot, after it bought the collaborative coding platform in 2018.
GitHub and a whole conflict ahead
The tool was introduced as an extension to Microsoft’s Visual Studio coding environment. It uses prediction algorithms to auto-complete lines of code. For this purpose, the AI model called Codex has been used. It has been created and trained by OpenAI, using data extracted from the code repository on the open web.
Microsoft stated that the tool was “trained on tens of millions of public repositories” of code, including GitHub. The company believes we are in an “instance of transformative fair use.” At this point, obviously, there are several open source coders who disagree with this relationship.
Butterick expressed on his website that:
“Like Neo plugged into the Matrix, or a cow on a farm, Copilot wants to make us nothing more than producers of a resource to mine.”
The programmer and lawyer continued to expand on his vision by commenting that “even cows get food and shelter out of the deal. Copilot brings nothing to our individual projects. And nothing to open up the code widely.”
The views of other programmers
Increasingly, programmers have noticed that Copilot seems to copy their code in the resulting outputs. From Twitter, open source users have spoken out by leaving documentation of software examples. In them, you can see the lines of code that are surprisingly similar to those that can be seen in their own repositories.
GitHub was to issue a statement on the matter, stating that training data taken from public repositories “are not intended to be included verbatim in Codex results”. The analysis report shared by the platform stated that “the vast majority of the results (>99%) do not match the training data”.
For its part, Microsoft has placed the legal responsibility on the end user to ensure that the code Copilot sends does not violate any intellectual property laws. Butterick understands that this whole situation is a smokescreen and that GitHub Copilot, in practice, is a self-serving interface that hijacks your experience without offering anything in return.