Facebook engineers: we have no idea where we keep all your personal data

Read original article here

ctur 2 hours ago | next [–]
In many ways you can think of large, long-living tech companies not unlike old cities like, say, London or Paris. The buildings and roads you see are built on top of older buildings and ruins. The streets are weirdly shaped and intersect at odd angles because they were made hundreds of years before and adapted over time as needs evolved. There are catacombs underneath sidewalks and no one genuinely understands it all nor does a single reference exist that explains it.
Literally everything predates everyone who lives there. Generations and generations of original designers, architects, and laborers have arrived, plied their trade, and moved away. There are people who are experts in certain parts, and who can build a new skyscraper at any given spot, but it is just layering and organic growth.
The emergent complexity of centuries of being lived in and adapted belies easy understanding.
Large tech companies are similar. You just can't understand how "it" all works. If you were to build it from scratch, perhaps you could, because it would be simpler and clearer, but nothing was made with the current state in mind. It evolved and adapted over time.
So reading this, I am not surprised. I think you'd get the same answer about many other aspects of data, code, system history, etc at any other 10+ yr old tech giant.
kixiQu 2 hours ago | parent | next [–]
I work for a large long-living tech company. Where legal compliance and security are concerned, "it is just layering and organic growth" is not something you get to say. If we were speculating about such a thing on a coffee break, maybe we'd get the kind of answer given here, but if a single reference doesn't exist for something that is necessary for compliance, people get paged until it does.
(There may be other things taken as seriously I'm not thinking of, but compliance and security are the two I've seen drag people out of their beds by their ankles)
nicolas_t 1 hour ago | root | parent | next [–]
Counterpoint, in my career, I've seen a lot of Security Audits that seem on paper to be very thorough and detailed but when you dig deeper, you can see that it only touched the surface and the auditor missed a lot of things.
kixiQu 59 minutes ago | root | parent | next [–]
I mean, sure, security is complicated, lots of judgment calls, but within this mindset, if a team/org/company said "we have no idea where we're using log4j" it was then their job to have someone working around the clock until they figured that out and fixed it. Figuring out how to meet new requirements that weren't around when systems were begun is, IMO, fully just part of the job. For FB engineers to be throwing up their hands in front of a legal system like oh nooooo how could we possibly track things seems like a stalling technique and makes all of tech look bad.
as per my other comment, in most of the sufficiently big or historical cases you have to distinguish between two things:
- reality
- the socially acceptable fiction in the report
the exercise would be to create an output that allows people to pretend we know where all instances of "log4j" were used, or that sufficient depth and resources have been expended on such.
but in a sufficiently large and complex organisation, we (and by we I mean technically able, knowledgeable or sufficiently aware individuals) know that the content of such is likely to be as factually true as reading a historian's report entitled: "enumeration of all points in past century where cutlery has been used".
ACow_Adonis 45 minutes ago | root | parent | prev | next [–]
Counterpoint: the conclusions of legal, compliance and security rest upon social fictions and power structures, but fundamentally at a certain scale, all things act like the original commenter said.
In my experience with the above, people get paged until a "socially plausible fiction or resolution" can be created.
There's several versions of these, one of which is to engage in enough rituals and processes so that we all pretend the task has been completed and the requisite outputs and points of reference exist. An archaeologist knows they don't and they can't and that we lay on bedrocks of speculation, history, ignorance and myth, but we don't have to worry too much about the fiction we create because we also know that disproving such mechanically is also extremely impossible/unlikely.
And we have a host of other social functions to get out of the cases where it could, such as:
- throwing people under the bus/human sacrifice/ finger pointing
- contrition
kixiQu 1 hour ago | root | parent | next [–]
They do have a burden. Following the law is table stakes. Now, are the laws relevant to a business the same across industries and around the globe? No. Does enforcement have its shit together equally everywhere? Obviously not. But making every team in a company classify the XYZ their component stores is absolutely doable and frankly an easy lift when the alternative is not doing business that involves XYZ.
PuppyTailWags 1 hour ago | root | parent | prev | next [–]
I think specifically any industry working in personal data collection of minors should absolutely have this governance burden. If Meta knows enough about their users to know Instagram is giving teenage girls mental illnesses, Meta engineers should absolutely be hoisted out of their beds if they don't have precise, well-defined, secure data stores.
conductr 26 minutes ago | root | parent | next [–]
The problem with your example is the engineers have built exactly what the company wanted. Perhaps the mental illness was an unintended side effect (perhaps known). But either way a decision was made that they want X and so the engineers built X. What you're talking about is banning X, which is something a bit difference than a governance burden/compliance issue as those can generally be followed to a specification and directly measured for deficiency (eg. how to handle HIPAA data, how long to store records, etc.)
If the issue is about teenagers being too immature for the product X, then typically we ban it or age restrict it using legislation (eg tobacco, alcohol, porn, etc)
I think we as a society need to figure out where social media fits in and how we want to contain it and it's still early days for that. As much as I like the idea of an age restriction, I look around the world and see 6 month olds playing with iPads, 6 year olds on tiktok, etc. as the more 'average' way to parent and wonder if I'm in the minority. My kid is 4 and has never touched a device as entertainment, we taught him how to eat out at restaurants/be in public instead of pacifying him. But that seems like an old fashioned parenting method.
Ok, he's technically used it briefly as a viewfinder for the camera and to facetime his grandparents
mschuster91 1 hour ago | root | parent | prev | next [–]
If more industries did carry this governance burden, we'd see a lot less cases of companies getting hacked left and right.
For way too many companies, "IT" is just another cost center that doesn't bring in any profits and so is barely kept at life support levels. And when something inevitably happens, it's rarely the C-level that gets to go to the unemployment office, it's more one poor soul who has complained for years that he needs more staff and budget to at least get the most egregious bullshit fixed.
phpisthebest 1 hour ago | root | parent | next [–]
I doubt that, many of the companies that have been hacked were governed by these regulations, and they spent money governing to the regulations not governing to security which is not the same thing
If I was burdened by regulation and red tape I can assure you my org would be less secure not more so
seti0Cha 52 minutes ago | root | parent | prev | next [–]
Hah, I might have agreed before I worked on a credit card processing system. There were plenty of rules alright, but it's possible to abide by all the rules without actually making a secure system. Some of the rules made it harder to create a secure system because they did not reflect the state of the art, or they made adding additional security too burdensome to implement. In the end, you can't force competence through rules.
adrianh 1 hour ago | parent | prev | next [–]
This is a seductive notion, but it's indeed possible for a tech company to understand how a single piece of data goes through its systems.
It might take a while, and it might involve weeks of code spelunking and dozens of conversations with engineers — but it is indeed possible.
A codebase is not encumbered with the physical constaints of an old city. It's not like trying to figure out the physical state of a certain area of London 30 meters deep — which might be impossible without digging and therefore disrupting other structures in the area. A codebase is simply instructions, which are readable and understandable (with the exception of black-box machine-learning models, but those at least have defined inputs and outputs).
>> It might take a while, and it might involve weeks of code spelunking and dozens of conversations with engineers — but it is indeed possible
If it takes a while it will be out of date by the time it is complete.
A single request might directly impact a dozen separate systems, which in turn call other services. You won't be able to find anyone who knows much past the layer they directly interact with. All of these services are being worked on. Some older ones are in the process of being deprecated in favor of newer ones. Someone is already working on the design of the replacements for the newer ones. Changes are being pushed weekly if not daily. If you send a request and then send it again a minute later, it might go through a completely different set of services based on a feature toggle.
You can't step into the same river twice.
leros 39 minutes ago | root | parent | prev | next [–]
The thing with tech though is that it's easy to add new connections, connecting anything to anything.
You might have two independent datasets that can't be stored together due to privacy or security regulations. The teams who work on that know and understand that. But some other team working on something completely unrelated might join those datasets for some non-nefarious purpose, which then creates a new dataset with those data combined. Now other teams who also don't know about the regulations might build on top of that combined dataset. They might even generate new data fields from the data that's not supposed to be joined. This stuff can propogate through layers of systems, queries, service calls, analytical engines, offline spreadsheets, etc. It's very difficult to keep track while keeping your engineering teams independent and nimble.
markstos 2 hours ago | parent | prev | next [–]
You can hire one or more data privacy people whose whole job it is to track this. They get an inventory of all the systems and talk to the managers of every system, find the answers and document them.
Meta is choosing not to know or pretending not to know. They have the resources to know where your data goes.
larnik 26 minutes ago | root | parent | next [–]
Hiring one person to do this at a company this scale, will give you results in the next decade. BTW, when the report is complete, it will no longer be accurate. Sure you can scale the process by putting more people on it, but it won't scale linearly. Additionally you will need buy in from all the system managers to dedicate time to assist the new data privacy team. It is doable, but comes at a cost.
klooney 1 hour ago | root | parent | prev | next [–]
I mean, I've worked with those teams for GDPR stuff. They have no context, and no access or ability to trace data flow, so they often write up very partial and incomplete reports because no one else cares particularly deeply about the problem.
jollyllama 1 hour ago | parent | prev | next [–]
To build on your metaphor though, there are cabbies with "the knowledge" in London, and probably similar folks in Paris. The way I have seen this work in tech is, you have to keep the people who know where the bodies are buried around for a long time, and others will use them as a resource to inform their efforts. The alternative, of letting them walk over the years, is that large sections of your stack will have "here be dragons" written over them and become no-go zones where your only option is to route not only engineering but business efforts around them.
rockinghigh 18 minutes ago | parent | prev | next [–]
> I think you'd get the same answer about many other aspects of data, code, system history, etc at any other 10+ yr old tech giant.
Processes to access personal data vary wildly among tech companies. At Apple, when a machine learning engineer wants to store and use data on the server side, they need to go through layers of approval from lawyers. Even when approved, it often comes with serious constraints about what can be done with the data. Meta is a lot more lax.
BiteCode_dev 2 hours ago | parent | prev | next [–]
Maybe, but one can find out. You can follow the chain of calls of any software, and they have access to everything, as well as the whole history.
It's like when in an interview Buffets says it's very hard to know where all the money is in the financial system.
Sure, it's a complex system, but it's more of a matter of incentive.
ACow_Adonis 15 minutes ago | root | parent | prev | next [–]
As someone who's arguably professionally employed to do things like tell people where money is in the financial system, no, you can't.
In the same way you can't enumerate the atoms or molecules in a human body. sure, you can come up with good estimates and summaries that are useful, but not the actual full and truthful state of affairs.
In addition, you are a self-referential participant in the system: so acting in an attempt to observe such likely immediately makes your information at least partially incorrect.
Additionally, if we can't get full/transparency or introspection into machine learning models or neural networks (and we can't because i don't believe we even possess the theoretical knowledge of how to do so), it should also be wilfully apparent that we can't actually know the true lineage or dependencies of systems that implement such in production.
Not to mention the difficulties as the inputs/outputs of such pass in and out of the notional entity or system being observed.
CaptainTaboo 2 hours ago | parent | prev | next [–]
It's not just the tech giants. I work for a company that is fairly dominant in its particular industry (If you're in the US, you've probably done business with one of its customers). Much of our growth has come through the acquisition of other companies, each of which gathers its own user and customer data. From my perspective in IT operations, I don't have any visibility into how we capture user and customer data, but I know we're big enough to track a user across the various products and websites, but still siloed enough that internally we still refer to the various business units as if they were separate organizations, each holding its own subset of the data. To sum up, this description of Facebook's internals doesn't surprise me at all.
liampulles 1 hour ago | parent | prev | next [–]
To extend this wonderful metaphor, perhaps the least terrible solution is for the "city workers" to constantly log any issues as they come across them in their normal day-to-day work. It is the responsibility of the city then to proactively go and fix those issues as they come up.
dvfjsdhgfv 1 hour ago | parent | prev | next [–]
As far as code goes, it's true for many companies. As for data, it was similar in many large European companies in the pre-GDPR era. Today, one crucial question is always asked: when you work with data, is this personal data? If it is, you need to deal it with a special way. Personal data is both an asset and a liability.
Most companies found a way do do it. It was a long and often very painful process, involving everything from data entry to backup processes and procedures, but somehow we managed to do it.
After reading some more of the transcript, I think the article does such a bad job of describing what was being asked for that it makes the court seem incompetent and Facebook actually rather reasonable. My original comment is below:
Government contracted construction workers: We Have No Idea Where Your Tax Money Goes
When tasked with answering the simple question "Which specific bricks did my tax money buy this year", the two veteran construction workers looked confused and tried to explain that that's not really how taxes work.
The special master at times seemed in disbelief, as when he questioned the engineers over whether any invoices existed for a particular road building contract. “Someone must have a receipt that says this is who the money came from that bought these bricks”
I'm not sure the analogy fits 100%, but it's the closest I could think of. This reads like the author thinks the Facebook algorithm is a human-readable decision tree that only takes into account the data of a single user at a time.
As usual, the problem is not data "collection" or "retention" or "privacy", but "creation". Regulation will always remain woefully inadequate to control such organizations, and the only solution is to adopt systems that don't spew user data everywhere in the first place.
jtbayly 2 hours ago | parent | prev | next [–]
That’s an absurd analogy that doesn’t fit at all.
FB uses the data it can’t find or tell you anything about to successfully sell ad space to third parties targeted back at you.
Let’s make it concrete.
We know FB keeps track of which webpages you visit. We also know they use that data as a way to help target ads. That data isn’t in the data export they give you, as far as I can tell from a brief search.
Is there data coming in about which Instagram profiles you follow and what pics you clicked and liked? Is there other data they keep track of? Undoubtedly.
I’m not surprised nobody knows the answer to what all it might be. But pretending that data is fungible like money is just misdirection.
handity 1 hour ago | root | parent | next [–]
Agreed, Facebook definitely has the ability to provide timestamped interaction events as part of "Download my Data". Maybe legislation will force them to provide that information at some point, but that won't really have any effect on user privacy.
The hearing seems to want to get more to the heart of the issue, but is inept to do so. The stored data isn't the concern, it's the predictive capability of models built on it. The decision of a predictive model isn't any more traceable to a single piece of data than a brick in a road is traceable to a given taxpayer, which is what I was trying to get at with the analogy.
After reading some of the transcript, the special masters seem very on the ball, and it's more the spin of the article I take issue with.
nathanaldensr 1 hour ago | root | parent | prev | next [–]
Exactly. If Facebook can target ads at specific users to begin with, then how can they possibly claim they don't know where the data is or how to find it? Then how do they know the users are being targeted "correctly" or at all? It's a ridiculous argument that falls apart under any scrutiny. Their entire business model relies on them knowing how to target specific users using huge graphs of information.
jerf 3 hours ago | prev | next [–]
Kind of serves as a good example of "single central conspiracy" versus "many actors given common cause to act in a certain way" ways of thought. It's easy to imagine Zuckerberg going to work every day and laughing maniacally as he personally shepherds your stolen conversation with friends yesterday about your experiences with laundry detergents into The Facebook Info Vault and then personally uses that information to send you ads about Tide laundry detergent, but what Facebook really is by sheer necessity is a whole bunch of agents operating with their own goals. The end result is a massive machine that turns privacy violations into money, but there isn't necessarily a single place where the bad thing happens. In fact you could conceivably be taken on a tour of the system and agree that every individual component is acceptable, or that the vast bulk of them are OK and there's only a single-digit number out of thousands that are problematic.
Nobody could possibly manage Facebook as a centralized single entity, but it's hard to imagine it any other way from the outside.
jkingsbery 2 hours ago | parent | next [–]
While I appreciate this distinction and is an interesting way to look at things, I don't think it's inevitable. I've never worked at Facebook, but I work for another large tech company. Every project at our company, particularly the ones that deal with personal data, must produce a threat model [1] before going into production. In another thread, someone claims "Well that's because the question is meaningless," but from a threat modeling standpoint an engineer needs to understand all aspects of "where" data is for the system under design. Where physically is it stored? What database is it stored in? And knowing that, one must look at what threats are available. Who has access to the data? What other systems have access to the data? How are those users/systems authenticated? Is that access logged?
Now, sometimes systems diverge from the design, and sometimes threat models are incomplete. But the exercise of generating a threat model makes understanding who manages data more manageable.
lupire 2 hours ago | root | parent | next [–]
> understand all aspects of "where" data is for the system under design.
There easy! But also useless! The data is everywhere. That's why every "where" was created. Everything is a potential threat vector and needs to be protected by layers of fallible security.
lupire 2 hours ago | parent | prev | next [–]
Not even conspiracy. Governments, free markert, natural world all works the same way. Many semi-independent actors making local optimizations, with varying levels of sophistication in organization and hierarchy. Many emergent properties. IT in the large is like physics, chemistry, biology, psychology anthropology, and sociology. It's not building one precise bridge over and over again.
personjerry 3 hours ago | prev | next [–]
Well that's because the question is meaningless - What does it mean "all your personal data"? What does "where" mean? Physically? Which tables? Which data is relevant? Friends? Friends of friends? Ad data? Behavioural data? ANY inferences, models built on user datasets that might include that user?
They don't even know how to ask the right questions to get to what they want to know.
Context: I worked at Facebook
PeterisP 2 hours ago | parent | next [–]
Yes. All of that. That is the right question, the very first question as a starting point. A basic requirement from a company is to provide an exhaustive, true, up-to-date list of all of these - and if they can't, they should not be permitted to handle private data.
PeterisP 2 hours ago | root | parent | next [–]
The way I read this, his response saying that a full team would be needed to uncover this is not a flaw in the question - because IMHO that's exactly what was required - but rather that Facebook has not done the work required to meet their legal requirements. Yes, it's quite plausible that Facebook might need a team to do extensive work to produce the analysis and documentation about where private data is flowing - that's not a valid excuse though, Facebook simply needs to make that team and do that work, no matter if they want to or not, until they can properly answer these questions.
They need to have an exhaustive list of how they're using private data, and they need to have a process ensuring that their engineers are not adding new sources of private data or not using existing private data without the company approving and updating that list. Yes, their current processes aren't fit for that - as Meta documents quoted in TFA say "We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments such as ‘we will not use X data for Y purpose,’" - so these processes must be changed.
alexhill 2 hours ago | parent | prev | next [–]
> They don't even know how to ask the right questions to get to what they want to know.
You're right. If the questions were being asked of someone who was genuinely motivated to answer them, this wouldn't be an obstacle.
The phrase "all your personal data" (or the quote from the article, "every bit of data associated with a given user account") feels vague and imprecise only if you have an internal understanding that for practical purposes, it really is impossible. The Facebook engineers know that and everyone here knows that. It's not literally impossible - an internal team of digital archaeologists with unlimited resources and no deadline might be able to do it for a single user. But practically impossible across the whole userbase.
If you were genuinely interested in providing insight to the hearing, you could explain all that, and then explain how you could go about creating a rough sketch of an answer, and how you'd go about adding some detail to that sketch.
The Facebook engineers have no real motivation to provide insight off their own bat. They can take the question literally, and answer honestly that they don't know, that nobody else knows, and that it's probably impossible TO know. I'm not even sure if this is acting in bad faith. I can easily imagine feeling cagey in this situation, and responding as such. Experiencing an emotion isn't acting in bad faith.
All that said...these questions were asked by a court-appointed subject matter expert. Either the subject matter expert is not the right person for the job, or...and now I realise that of course I haven't read the transcript and there's every chance that what's quoted in the article is not the most insight the court managed to prise out of the engineers. posting anyway in 3 2 1
>Well that's because the question is meaningless No.
>What does it mean "all your personal data"? It means all your personal data.
>What does "where" mean? Where facebook stores it.
This is only a complicated question if you want to start saying, "Well this isnt really their personal data, we just use it for x..." If its data related to my account or me as a user it's my personal data.
Maybe its difficult to track all that down but certainly that is a challenge worth tackling in a world where GDPR exists.
seanhunter 3 hours ago | parent | prev | next [–]
Companies which deal with personal data do so in the context of regulatory frameworks (CCPA, GDPR etc) which define specifically what the definition is and what’s covered. They also have to record what they collect and process and for what purpose.
If you’re Facebook you have to know this in order to process that data. You can’t pretend not to understand the question.
lupire 2 hours ago | root | parent | next [–]
The people who write the question don't understand the question. They are "not even wrong". They are telling the businesses to figure it out. Maybe that's desirable, or maybe it's pointless. In prod strokes you can say data is either fully controlled by the users, shared with other user data for product model building, or also used for ad targeting. But it's unclear if being more specific is useful to anyone. You can point to an example like equal housing opportunity enforcement, as it interacts with ML-targeted or attribute -targeted ads. It's equivalent to detecting discrimination anywhere a human makes an opinionated choice. It's a problem no one knows how to solve. It' not a law like "don't shoot or stab people" (which is also complex in some ways).
zo1 2 hours ago | parent | prev | next [–]
I just started reading the transcript. They asked a lot of questions trying to figure out details about various products.
It's 850 pages long, and the few spots I jumped to the guy kept answering with "I don't know", "I'm not familiar with this", etc. Even the parts he says "I'm familiar with this service", the guy goes on to answer "I don't know more than what's written there". The guy couldn't even answer if certain databases were accessible by third parties (not implying they were).
This must have been painful to be a part of. 100 variations of "I can neither confirm nor deny".
ridgered4 1 hour ago | prev | next [–]
I recall when Windows 10 came out some company had demanded to know all of the things it sent back to Microsoft over the wire. Microsoft had some guy basically run wireshark and a bunch of network scans while using it and put that in a report. It blew my mind because it implied nobody at Microsoft actually knew what it was collecting and when anymore than your average security researcher, possibly because there are so many sub-entities in the company hoovering it up for different purposes.
snowwrestler 2 hours ago | prev | next [–]
There is a difference between what is known and what is knowable.
For example if you asked me now if I know where all the receipts are for my tax-deductible donations, I would say no, I don’t know. Why? There is no reason for me to know right now.
But if the IRS told me I was being audited and must produce those records or go to jail, I would find them. And I have confidence that I could, because I know that in general, those are the sorts of records I keep (somewhere).
Most Facebook employees do not need to know “where personal data is” to operate their business, so they don’t know that. But at the same time, Facebook could not operate if the data was “nowhere” (destroyed), or if it was randomly distributed with no queryable structure.
So the question is not really what they know now, it is: what incentive do they have to go find out.
kornhole 1 hour ago | prev | next [–]
Because I suspected this mess, I took a less expedient path to deleting my account. I found a picture of somebody with my same name and made it my picture. I unfriended everybody I actually knew and only kept randos. I liked a bunch of random stuff. I did not actually delete my account until a year later to let the chaotic data propagate into all the subsystems, backups, and partner networks.
citizenpaul 1 hour ago | prev | next [–]
This is misdirection plain and simple. In the best most convincing way. Get tunnel vision (sorry) nerds to make an honest display of their ignorance of the bigger picture at Meta. Aww shucks we don't know where that data comes from....
Of course senior engineers don't know where the data comes from exactly. Does your mechanic know where each tire comes from or care? Does your fav restaurant chef know each field that a produce came from? The suppliers know that info.
I know beyond any doubt that there are most definitively some people that know at least roughly where most data about most subjects are at Meta. I'm sure the data is a vastly beyond what a human can process. However some people there know where to go looking. The truth is in the Piles of money they generate by selling that data.
Awww gee wizz your honor. We have no idea how we do it. People just keep throwing piles of money at us so we keep taking it. I don't know why they do it. We definitely don't have mountains of highly specialized data on our users that we tell them we can get exactly what they want then provide that info to them. Somehow though when the court asks without piles of money we just can't make that magic happen...........
LinuxBender 2 hours ago | prev | next [–]
This article reminds me of past audit experiences. It is important to know who to send into the conference room to talk to the auditor. The wrong people can answer too many questions or volunteer the wrong information, leading to more questions and going deeper down the rabbit holes. I suppose the same could apply in this case. It's just a gut feeling.
mathgladiator 2 hours ago | prev | next [–]
I believe we need a privacy centered view of technology. Even a traditional database creates a spiderweb of 'where is your personal data spread across so many tables', and I hope there is a reckoning against this approach.
I have considered pivoting my service into a privacy respecting data store ( https://www.adama-platform.com/ ), but I've yet to meet anyone that actually cares about user privacy to rethink their data layer.
fsociety 2 hours ago | parent | next [–]
Actually FB has some wicked privacy tech around their central data store - and are building more. Nuance gets lost in these articles, for example even with proper documentation there are simply too many things for one person to know where data flows in that company. Or data flowing through RPC calls can make it hard to know what will fail downstream if you remove a certain piece of data.
Delphiza 1 hour ago | prev | next [–]
Many here on HN are saying that such a record of what tech companies do with data is a) impossible because complex systems have evolved over time and the (historical) flows of data are unknown and b) unnecessary because the need to know what happens to data is not relevant to the user or the questioner. The old models of processing data at scale to extract value, as FAANG has been doing, are eventually going to come to an end. Maybe not where there is a power imbalance, such as between Facebook and their users, but at least where users of services have more clout and are paying for the product. I see this in B2B IoT solutions where big customers are very picky about how telemetry is collected by product vendors and, if not pushing back hard, are at least choosing not to use services that are not clear on how data is processed and handled.
The amount of data that we all see, every day, that is grossly mishandled may signal the end of the ML and AI goldrush. You can only build models, and run data through models, with the consent of the owner of that data. Large producers of data (think vehicle fleet operators) are beginning to take ownership of their data, and are only _licensing_ it to processors for very specific purposes. In the example of vehicle fleet operators, they may only want route planning, and not have their data used to sell them tyres based on mileage. Also, while governments may be busy with other stuff currently, at some point they may decide to turn on the regulatory screws.
> The systemic fogginess of Facebook’s data storage made answering even the most basic question futile. At another point, the special master asked how one could find out which systems actually contain user data that was created through machine inference.
“I don’t know,” answered Zarashaw. “It’s a rather difficult conundrum.”
This is not a basic question because of how ML works. Can we say a system contains my data if my data was used as a training input to a neural network at some point in the past?
In that case, can I sue anyone using Stable Diffusion for stealing my data because the billions of images in its training set included something I created?
dannyw 2 hours ago | parent | next [–]
> In that case, can I sue anyone using Stable Diffusion for stealing my data because the billions of images in its training set included something I created?
No, because of the landmark Authors Guild, Inc. v. Google, Inc. case which found that ML training can be considered fair use in some circumstances.
tru3_power 2 hours ago | prev | next [–]
There seems to be a pretty concerted effort to paint Meta as worse than the other big ad-tech companies. Maybe it’s just the contrarian in me, but I’ve noticed a way big uptick in these stories over the year.
y42 10 minutes ago | prev | next [–]
I assume it's the same in every company after a particular amount of time. They start clean and then everything grows organically because you spend more money in innovation than cleaning up and documentation. It's not even a thing of competition or capitalism, but limited resources.
0xbadcafebee 1 hour ago | prev | next [–]
"The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript."
Oh you sweet summer child.
mrweasel 3 hours ago | prev | next [–]
Besides the legal issues, isn’t there also a technical problem? Facebook is potentially storing the same information multiple times, or missing opportunities, because a engineer did know that some data was available.
I guess storage is cheap enough, but even for Facebook, being able to save say 10% on storage must mean something.
canoebuilder 1 hour ago | prev | next [–]
What are we talking about here? Data associated with specific users?
Well in order for your grandma to log into Facebook her user account must have a primary key associated with it so she sees her info when she logs in and not someone else’s.
We are talking about computers and databases. When did using a computer to search a database become a difficult, nigh impossible thing to do?
Even if design documents and flow charts or whatever don’t exist could they not fairly straightforwardly be reverse engineered by taking a sample of users and searching all databases for associated information?
This seems like a transparent ploy on the part of Facebook to avoid regulation by casting the perfectly doable, searching a database, into some sort of incomprehensible impossible task. The credulous author of the article and many comments here seem to be strangely buying into it.
FB investors don’t seem to have lost faith in the company’s ability to search databases. When it comes to making the company money from those searches, no problem at all. But regarding lawsuits or potential regulation, we can only throw our hands into the air and simply wonder and gasp at this incomprehensible mess we have made.
jtbayly 3 hours ago | prev | next [–]
> The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript. Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.” He quickly added, “For what it’s worth, this is terrifying to me when I first joined as well.”
The cynic in me wants to say that not documenting certain decisions is a common method of hiding immoral decisions.
roland35 2 hours ago | parent | next [–]
Trust me, at (most) big tech companies there isn't anything that nefarious going on at the engineer level. In fact everyone basically agrees that improving documentation is important, but it constantly gets swept aside in the constant churn of new features, fires to put out, and frequent meetings. Information gets spread out across various docs, wikis, readmes, etc.
What ends up happening is each part of the system has a "point of contact" who is the go to for that piece. Lots of decision history is in the heads of team elders. Hopefully they are still around!
laweijfmvo 1 hour ago | parent | prev | next [–]
Imagine that someone asks you to modify a service; so you study the code, make the change, test it, and push it. Now imagine they ask you to update all the docs/wikis to reflect the recent changes, or more realistically you just create another one and store it wherever you feel is appropriate.
In my experience there’s a ~10% chance an up to date piece of easily discoverable documentation exists for any service or component, and there’s a ~90% chance at least one out of date document exists. The later seems much more harmful, like when you realize some PM has been referencing an out of date doc for the last 9 months.
988747 2 hours ago | root | parent | next [–]
I've been a programmer for 15 years and I've never seen it done for "job security". I think this is mostly about low tolerance for boredom of most programmers: writing code is exciting, documenting it is boring, so you do not do it unless forced to.
adamsmith143 1 hour ago | parent | prev | next [–]
So FAANG is held as the gold standard in the industry and this is their best practice?
>We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.”

Images Powered by Shutterstock

The Data Daily

Facebook engineers: we have no idea where we keep all your personal data