Logo

The Data Daily

The Role of Context in Data

The Role of Context in Data

Okay, you've just finished filling up the data lake with your data hose, the data warehouse is filled to the brim with data bricks (if not Data Bricks), your data scientists are ready in their lab coats to pull the big red switch that will call the lightning and turn all your data into some living, breathing, Datastein's Monster (and have of course perfected their synchronized versions of "It's Alive! It's ALIVE!!!"​), yet something still seems to be missing?

Maybe it's the fact that the profit margins from your quarterly reports seem to be out of sync with the the sales figures because they use different reporting years. Or that you have eighty two different labels for personal name, with some being first name last, others first name first. Or that each of your databases has a different key for the same book or widget that you sell. Or that have have twelve different sets of codes for states or regions, have seven different ways of representing key solvents, or have numeric codes for classifications with no underlying metadata.

The moment that you start dealing with heterogeneous data, you will be dealing with heterogeneous data representations. Most integration efforts ultimately end up facing this conundrum, and the more fields or features that you are dealing with, the more complex this operation is. Typically, this complexity is due to several factors:

Dimensionality. Every numeric value is either an integer, a real number, or a complex number (with the possible exception of zero, which can be any of these), but such values also have a dimension: number of things, temperatures, physical or temporal lengths, and so forth. This is metadata that is occasionally captured in labels (or more rarely in data dictionaries) but without this, it can be awkward (if not necessarily that complicated) when trying to bring together multiple ontologies or schemas.

Temporality. A related facet comes with dates and times, which can have upwards of hundreds of representations depending upon the region, the calendar reckoning, and time-zones, not to mention situations where exact dates are unknown or are known only approximately. Additionally, there are subtleties - if you have a date but no time mixed with dates + times, how do you determine when in a given day an event occurred.

Provenance. Being able to state where a particular piece of information comes from has historically been difficult, and often times, especially with data lakes and warehouses, gets lost entirely. Without provenance, information isn't auditable.

Classifications. A classification is the association of an entity to a particular class or type. There is a tendency to assert that there is a dominant class that most closely templatizes that entity (so-called is a relationships), but in practice, most entities are actually identified by several classifications, some acting as roles, some acting as qualifiers.

Externalities. Externalities are relationships that are imposed on entities. For instance, a sales transaction involves a sale of a good or service between a buyer and a seller for a specific amount. Not surprisingly, externality records actually make up the bulk of all records in relational databases.

Authority Identifier. An authority identifier is a key that represents an organizational entity, issued publicly by that organization. Such identifiers (when qualified with the issuing authority) are usually long-term stable. For instance, a vehicle may have an internal production number, but within most countries vehicles also have Vehicle Identification Numbers, as issued by their respective government.

Schemata. Schemas identify, for a given relationship, the characteristics of that relationship and, in the aggregate, the structure of that data. They go by many names: schemas, data models, ontologies, each of which actually have some differences, but overall they exist to identify the relationships and rules that data structures follow.

This cloud of metadata that surrounds data is itself data - metadata can be (and usually is) both. The distinction with metadata is that in general it provides data about the data, information that identifies the origin, role and structure of that data. Collectively such metadata is known as context.

Context is a powerful concept, but it's also a somewhat fuzzy one. It's origins came from the Latin textus (to weave) and con (together or with), though its earliest modern usage came from the 15th century (and the dawning of the printing era), where it described the contents of the various printing folios that were then woven together to create the book - in this case one had the given folio that contained specific information, then the "context"​ of the other folios that both preceded and followed that given folio. This very quickly evolved into the more abstract notion of that which is currently known, that which was previously known (the past) and that which remains to be known (the future).

As such, context can be seen as being the cumulative answers to the meta-questions what, when, where, who, how and why about a specific entity or event. What kind of thing is that duck? Where was it found? Who found it? What time? How was it rescued? Why did you rescue it? It if looks and acts like a duck, is it a duck?

Not surprisingly, context as a concept has been seminal in philosophy, ethics, linguistics, law, psychology, and increasingly in computer science, where context identifies the focus of specific actions. In natural language processing, context consists of the parts of a lexicon that are relevant to a particular word, phrase or similar textual container.

In the realm of semantics, context can be identified for a given entity in a visual form. If that entity is a node in a graph, then the context of that node consists of all of the relevant the nodes and edges that are connected either directly or indirectly to that initial node.

However, there's also a caveat to this. Contexts are bounded - even if there are connections between two nodes, the farther apart the two nodes are (in terms of the number of edges or relationships) the less relevance exists between the two, with the further caveat that relationships that form transitive closures tend to have higher relevance than those that don't.

So what exactly does that mean? An example may help here. Back in the 1980s there was a game that gained some traction for a while called Six Steps to Kevin Bacon. The idea behind this was initially that it was possible to connect any actor, producer or other Hollywood creative to the actor Kevin Bacon based upon movies that may have featured actors that were also in movies in which Bacon appeared. The same idea can be applied for anyone relating to Kevin Bacon (or anyone else for that matter) anywhere in the world. (It was this idea that led to the creation of Linked In, by the way.)

However, it's also worth noting that the further away (and the more indirect the relationships), the less relevance that path has. The strength of that connection becomes attenuated over multiple hops on the graph. In that respect, you can think about context as being the collection of all paths from a particular entity to other entities where the path is above a given relevance value.

A different metaphor may be of value here: You're in the woods with a flashlight on a foggy evening. You can see the trees immediately around you, but those that are farther away are more obscured, either because the fog attenuates the light or because the shadows of the trees obscure what lies behind them. This area that you can see is your context - it's where you are in the graph and it contains that which is most immediately relevant to you. Technically speaking, the whole of the graph is the context, but your flashlight illuminated area can be thought of as an operational context - and it is this concept, operational context, that is so important in the realm of search and semantics.

Before exploring this further, I want to look more closely at the second caveat I mentioned earlier. Sometimes you have relationships that repeat. For instance, in a family tree, the has ancestor relationship can be applied over and over again. A has ancestor B, B has ancestor C, C has ancestor D. These kinds of relationships are called transitive - because they are, "A has ancestor C"​ and "A has ancestor D"​ are both also valid. Consequently, such relationships usually tend to have a lot of shared context.

Ontologists tend to look for transitive closure relationships, because when they are found, there is usually a lot of commonality about them. They form closures. This is where you get into set operations. For instance, a fiction series is a collection of books, usually with the same central protagonists and a broad story arc. A book in turn has a story arc as well, and the book is divided up into chapters, each of which have specific arcs too.

While a chapter is not a book and a book is not a series, the story arc for a chapter is contained in the story arc for a book, and the story arc for a book is contained in the story arc for a series. Therefore, the story arc for a chapter is contained in the story arc for a series, and as a consequence the set of characters in that chapter will be a subset of the characters in the series. Note that transitive closures are usually not symmetrical. Not all characters that are in the series will be in a given chapter.

However, it is far more likely that there will be some relationship that exists between any two characters within the series than there would be if you picked two arbitrary characters from two different series. The transitive closure in effect provides a narrowed context for search. For instance, if a character references her sister in a given chapter but doesn't specify her by name, but does give her mother's name, and in another arc her sister also gives her mother's name, then an inference can be made that, since the female child of a woman who is not you is your sister, then the relationship can be determined.

Where operational context makes a difference here is that it significantly reduces the number of characters who would need to be searched from potentially billions to likely no more than a few dozen. There's an old joke about a man searching around a streetlamp for his dropped keys late at night.

A good Samaritan comes up offering to help, and asks "where'd you drop you keys?"​.

The man points down the road aways and says "Somewhere over there"​

"So why are you looking here?"​

"I can see the road here."​

If asked what an ontologist does, some ontologists will tell you that they are in essence looking for shapes in meaning. Not surprisingly, this doesn't make a lot of sense to most people, but it's not inaccurate. Taxonomists are people who classify things - they take a given term (for a fairly loose definition of term) and place it within an already established rubric or heading. They do so primarily by identifying the facets that most clearly define a given entity, then looking where those facets best intersect. Put another way, they do so by identifying the operational context of a given entity.

Ontologists, on the other hand, try to figure out the best way to create those operational contexts in the first place, how to optimize them, how to transform them, and how to utilize them for optimal effect. In some respects, this requires that they look at concepts less as meaning, and more as molecules, in which it is the shape of the concepts, and not the specific properties of those objects that are important.

For instance, consider a particular family, consisting in this case of a mother (Jane Doe), a father (John Doe), twin sisters (Emily and Erin) and a brother (Michael). Now, let's say that you were looking at how to model that family optimally. The first approach might be to create basic relationships:

Where things become more complex is when you realize that there is no way that you can incorporate a number of edge cases, such as Michael being the son of John and Lisa from a previous marriage, and other metadata (such as dates of marriage) also become more complex. An ontologist would look at this and may propose an alternative model, one where you have primary classes:

Ironically, you can also combine the two graphs, which, while visually a bit complex, gives you some benefits:

The principle benefit? You have transitive closure with the pair (has Birth Child |has Adopted Child) which you don't with has Child. This doesn't have much immediate application, but if this is being used for a genealogy application, that transitive closure can be used to narrow down potential ancestors (or descendents) while not picking up half-relationships (Emily would have a grandmother through Jane, for instance, but not through Lisa).

Meanwhile, the introduction of two new classes - Marriage and Gender - make it possible to attach metadata to these relationships. In essence, we're identifying the shapes of relevant entities.

The notation was first introduced to me by Dave McComb of Semantic arts, and combines Sparql notation with an indicator giving the cardinality of the relationship:

SymbolMeaning*Zero or more+One or more?Optional (zero or one).Required (one and only one){n,m}From n to m items{n,}At least n items{,m}At most m items

The ?item terms indicate instances of a given class, and for most ontologies represent the dominant classification (i.e., something analogous to the object of an rdf:type statement). Thus, the first line above would be read as "Items of class marriage"​ have a property marriage:hasSpouse which has one or more values. Why not specifically limit this to two spouses? This becomes a legal question, ultimately, and the question emerges about whether there will be counterexamples that will require a change in cardinality. In some Muslim countries, for instance, polygamy is considered legal, and to the extent that the model can anticipate the exceptions, it is usually better to err on the less restrictive definition unless it can be proved otherwise that exceptions can be handled.

Modeling is the process of identifying the minimum number of relationships a given set of entities needs to adequately describe the business requirements for that model. Entities generally are temporally bound. They come into existence, for a certain period of time, then they go out of existence, and during that existence, they are instantiated. They can because of that take some form of unique identifier.

Categorizations represent one or more potential states that a given entity can be in. They may be assigned subjectively (such as the genre of a movie or show) or represent a particular set of constraints for access (such as the movie content ratings that identify a minimum age to see such a movie). While these are often conflated with controlled vocabularies, most controlled vocabularies actually specify entities, brands, or organizational programs.

Indeed, the bridge between categorization in semantics and that of machine learning really comes down to the degree to which one can decompose the category into orthogonal linear unit vectors. Putting that into English: what we call genre can be broken down into a set of scales: was it completely serious, completely comedic or somewhere in between? was it completely realistic, highly fantastic or somewhere in between. By getting a sample of users to identify genres based upon these scales (and possibly cross-correlating with movie ratings to test these) it becomes possible to determine abstraction genres (or other controlled vocabularies, such as user sentiment) and then calculate genre from numeric preferences. You can get into full machine learning approach by also having your test audience provide a weighting factor to determine the importance of a given scale in calculating genre (or many, many other types of variables).

Having traveled in a seemingly tangential direction for a bit, it's worth bringing the discussion back to the central thesis of this article. One of the most important things to understand about context is that it changes. Assignment of categorization changes as the underlying data in the system changes. The next generation of data "storage"​ - likely on graphs - will in fact be more like Conway's Game of Life than a traditional database, in part because we are coming to realize that state changes in one entity almost invariably cause reactions in other entities at they adjust and that it is easier to conceptualize "updates"​ as the creation of new conceptual graphs (even though in practice such changes are still comparatively sparse).

This is one of the reasons that working with such large scale dynamic graphs requires a different way of thinking about how we build data systems. For instance, in traditional databases, creating stored procedures was frowned upon, because of the potential of system cascades. In semantic databases, on the other hand, Sparql Update involves a two step process of identifying the nodes and relationship patterns to be changed, using this to calculate those nodes that need to be removed and added, then making this change. Unlike with a traditional database where the schema is considered immutable (you can add or drop columns or tables, for instance, but these are done under a separate transaction than would be the case with data). with semantic systems you can in fact change properties and relationships dynamically.

As an example, in a semantic update you could change a column that gave a length in inches to one that gave a length in centimeters, while at the same time changing the property name and corresponding dimension information for that property. This is novel in several respects. It turns out that when modeling temporal relationships (such as marriages, above) it is useful to have a "now"​ graph that contains simplified relationships that specify the state of a marriage as it exists at the moment:

This query checks first to see whether the marriage between person1 and the old person2 is a triple, and bails out if it doesn't (i.e., the link has already been removed). It then identifies marriages that have ended and new marriages that have begun (if any), checking further to ensure that it's not processing existing marriages. Once this is done, the current state of affairs can be represented with the inferred triples.

These are transient assertions - they are true today, but may not have been true yesterday or may not be true tomorrow. A good knowledge graph is filled with these kinds of assertions, which are in fact easier to query. What's so significant about these is that they show how context comes into play. If you insert the simple assertion:

into the triple store, and invoke the rule above as part of the assertion process (or perhaps in timed fashion if such updates are applied in a batch mode), then it is the context that drives the changes, using logical inferences to determine what that context is.

This is a sea-change in another regard. With complex records, especially when you have hierarchical structures, it is rare that the entire record changes. More likely is that there are a few key triggers - a change of status as above - that change, with the bulk of the record remaining the same. This notion that you can in fact make just attribute-level changes is an important one, as it reduces the complexity of information and the potential for either corruption or miskeying. In short, you access the smallest amount of information necessary to maintain contextual awareness, change that at the edge, then update it in the store.

This is something of gedanken-piece for me. Context is an important concept but is also notoriously difficult to articulate. At least from a data perspective, context is also something that makes more sense when talking about graph and inferential systems, because in both cases (there's a lot of overlap) the context consists of the metadata associated with that data, which in turn has it's own metadata. In theory one can go arbitrarily deep, though it has been my experience that operational context, where there's an inverse correlation between relevance and graph depth (the farther away two nodes in a graph are, the less likely that the node has significance to the starting node), plays a major part in deciding how to model contextually.

I hope to come back to the concept of context again, as I think it is one of the most important notions in both semantics and machine learning. I also want to explore operational context more at a deeper level of formalism, as I suspect that it gets into Kurt Godel territory, touching in a code-oriented way with the idea that incompleteness and context are complementary principles.