Recently, researchers Bob Diachenko and Vinny Troia discovered the “treasure box”. It turned out to be an Elasticsearch server with 1.2 billion user accounts that were made public on the dark web and anyone could “go on this tour.”
Where does the data come from?
When people look for public information through BinaryEdge and Shodan, they stumble upon that the server’s IP address can be traced back to Google Cloud Services, the researchers said. Overall, the database stores more than 4 terabytes of public data for public access.
As the core technology of full-text search engine, Elasticsearch exists as a search engine based on the Lucene library and is used in corporate information websites, media sites, government sites, commercial websites, digital libraries, and search engines.
A look at the details shared by the researchers revealed that the data was captured from social media platforms including Twitter, Facebook, LinkedIn and GitHub, which also performmanaged services for a repository of Git, an open source distributed version control system.
The data is classified into four different data sets on the server, three of which are labeled as “People Data Labs” and one is labeled as “OxyData” data agents.
Troia claims that he found a fixed-line number he had done 10 years ago at AT.T. in People Data Labs (PDL). He never used the number, but the information entered at the time was kept here.
The server was found to contain nearly 3 billion PDL user records, nearly 1.2 billion unique people and 650 million unique email addresses. The amount of this data is not only consistent with pDL’s propaganda, even researchers can reverse query the data through the information returned by the PDL API.
In addition, the researchers compared the database with the public data of the two companies, and found that they originated at least in part. The researchers detailed the wording of the PDL in a blog post:
The data found on the open Elasticsearch server almost exactly matches the data returned by the People Data Labs API. The only difference is that the data returned by the PDL also contains educational history.
There is no educational information in any data downloaded from the server. Everything else is exactly the same, including an account with multiple email addresses and multiple phone numbers.
However, PDL co-founder Sean Thorne denied the company’s claim that the server was owned, saying its owner may have used an expansion product from PDL, as well as other data enrichment or licensing services.
On the other hand, 4 terabytes of user data, including 380 million profiles, were confirmed to be from OxyData, but the company also responded that it had no ownership of the server.
So far, researchers aren’t sure who exposed the server to the Internet, but the information breach means it will affect the two companies’ co-customers and put them at risk of data misuse.
It’s not the first time.
In addition to this incident, elasticsearch servers have been made public on several occasions, putting the personal data of unsuspecting users and businesses at risk:
Earlier this year, the personal information of more than 20 million Russian citizens was disclosed on elasticsearch servers.
In May, millions of Canadians with CVV codes and payment card data were again exposed after an online leak of the Elasticsearch database owned by Freedom Mobile.
In December, another database containing the personal information of 82 million Americans was exposed online.
Elasticsearch servers have been plagued by data breaches that have attracted the attention of a large number of attackers, as this could be an entry point for their attacks.
Jason Kent, a hacker at Cequence Security, commented, “We’re seeing a new and potentially dangerous data association that’s different from what it used to be.” If an attacker has a rich data set, he or she can create a highly targeted attack. Such attacks can lead to the exposure of password recovery information, financial data, communication patterns, social structures, etc., which is a way for high-level personnel to be targeted.
The FBI has yet to respond.
Two researchers reported the findings to the FBI, although the Elasticsearch server typically takes time offline. However, the latter did not give a clear response after receiving the message.
Randy Koch, chief executive of ARM Insight, said the massive data breach was devastating for companies that were seen as owning data, while also spilling billions of people’s information around the world.
The sheer volume of personal data contained, coupled with the complexity of identifying data owners, can raise questions about the effectiveness of our current privacy and data disclosure notification laws.
This event can be effectively prevented if a company with data control collects and centralizes its user information, because the process of data synthesis mimics real data while eliminating user-identifiable characteristics.
When properly synthesized, it cannot be reverseengineered by hackers while retaining all the statistical value of the original data set, so it can still be used for analysis, marketing, customer segmentation, AI algorithm training, and so on.
However, centralizing data offsets the reputation of taking control of the business as data, and is also risky for privacy and compliance.