Jeff Jonas
Using Transparency As A Mask
Wed, 08/04/2010 - 22:42[This is re-post. I posted this on the Concurring Opinions blog August 2nd, 2010.]
As mankind deploys increasing numbers of sensors, and makes more sense of this data, more of our secrets are revealed. In a world of greater transparency, will you be able to be you? Or will you feel obligated to mask who you are, drawn to the safety of the center of the bell curve?
Will a more transparent society make you average?
Imagine for a moment that video feeds from street surveillance cameras are the blue puzzle pieces, your path through life lit up by your cell phone location as the green puzzle pieces and your Facebook social network as the yellow puzzle pieces. Flicker the brown puzzle pieces and Twitter, orange puzzle pieces. And maybe one day your energy consuming devices in your home may be spewing out the magenta puzzle pieces. As increasing volume and range of data converges, a colorful, highly revealing picture of our lives will unfold, with or without our knowledge or permission. Traditional physical sensors like credit card and license plate readers are one thing. The human is the sensor, thanks to Web 2.0, is altogether a different thing.
Unlike two decades ago, humans are now creating huge volumes of extraordinarily useful data as they self-annotate their relationships and yours, their photographs and yours, their thoughts and their thoughts about you … and more.
With more data, comes better understanding and prediction. The convergence of data might reveal your “discreet” rendezvous or the fact you are no longer on speaking terms your best friend. No longer secret is your visit to the porn store and the subsequent change in your home’s late night energy profile, another telling story about who you are … again out of the bag, and little you can do about it. Pity … you thought that all of this information was secret.
How will mankind respond? Will people feel forced to modify their behavior towards normal only because they fear others may discover their intimate personal affairs? This is what Julie Cohen and Neil Richards have worried about – the “chilling effect.”
Or, more optimistically, will the world become more tolerant of diversity? Will we be willing to be ourselves in a more transparent society?
Personally, I shiver at the though of being on the hump … the hump of the bell curve. I hope for a highly tolerant society in the future. A place where it is widely known I am four or five standard deviations off center, and despite such deviance: my personal and professional relationships carry on, unaffected.
And oh, by the way, more goodness … diversity is good for resilience.
Miscellaneous: About the title of this post: I just thought it was a funny expression. Other funny expressions I enjoy include:
1) Kill all extremists.
2) When you can fake sincerity, you have it made.
RELATED LINKS:
David Brin’s: The Transparent Society
RELATED POSTS:
Six Ticks till Midnight: One Plausible Journey from Here to a Total Surveillance Society
Santa’s Surveillance Operations Center Enjoys Big Gains in 2008(*)
Ubiquitous Sensors? You Have Seen Nothing Yet
USC School of Cinematic Arts, “Imagine the World in 2050″
Van Halen, Risk Management and Breaking the Law (Allegedly)
Transparency, Privacy and Responsibility
P300 “Brain Fingerprinting”: A Very Freaky Future Indeed
The Truth is Out There … Way Out There … and Some Times it Should Be Left There
Responsible Innovation: Designing for Human Rights
Podcast: The Future of Privacy
Your Movements Speak for Themselves: Space-Time Travel Data is Analytic Super-Food!
Puzzling: How Observations Are Accumulated Into Context
Hell with Rules
Fri, 07/16/2010 - 06:15Not long ago I found myself at a major financial institution talking about one of their fraud detection systems. Over the course of the conversation I stumbled onto the fact they have over 10,000 rules in place to detect fraud ... and oh so proud they were.
On the surface that might sound “powerful and amazing.” Nonetheless, that struck me funny it did. 10,000 rules … WOW! That must be brittle, expensive, and one giant liability I thought to myself. Such a detection system would catch exactly 10,000 things, nothing more, nothing less. Every new discovery would lead to new rules. Over time as the rule library further bloats it would get harder to manage and probably get slower and slower. And by the way, how many people actually understand all those rules and their interrelationship? Then as those people move on, how hard is it to get new people trained up on all those rules? Will they still be bragging about their extensive rule library when they have 20,000 rules?
Imagine telling your kid one day to quit throwing rocks at cars. Only to realize the next day you have to tell them to quit throwing rocks at SUV’s. Then the coming days, you realize you must also tell your kid not to throw rocks at trucks, fire engines, and ambulances. Ummm … 4,172 rules later you must come up with new rules like “don’t throw cans of Dr. Pepper at trolley cars.”
How about: “Don’t throw things at other people’s stuff.”
As parents quickly discover, teaching a principle like this is a much better course of action. While certainly not a perfect principle, at least it would roll up hundreds of explicit rules and catch countless conditions you never thought of. And yes, maybe this simple rule needs to be extended e.g., “unless they are bad people doing bad things and they need to be stopped.” That way if someone is coming at them on a skateboard with a knife they know it is okay to throw a chair at them.
Now back to the real world and a real example from my past. Circa 1993 we were building the first NORA (Non-Obvious Relationship Awareness) system for a casino. In this system the first relevance rule was basically: “Tell me when the bad guy is the good guy.” This one rule was created to detect and alert for such things as: the slot club loyalty card member is banned from gaming (on the Nevada Gaming Control Board’s Excluded Persons List) or the job applicant is a known gaming felon.
The second relevance rule was: “Tell me when the bad guy knows the good guy.”
With just these two rules, the system started kicking out all kinds of valuable, unanticipated insight including one of my favorites: An alert surveillance room operator noticed a dude cheating on a roulette table … making bets after the ball fell (called “past posting”). Dealers are supposed to watch for this. But somehow today this dealer kept missing this obvious scam. Casino security detains the cheater. The dealer says “I can’t believe this happened to me, I am so embarrassed, you surveillance folks are sure doing a good job, it won’t happen again.” During the arrest processing, the cheating player provided a different last name and address than used by the dealer. Fortunately, the cheater provided his real home phone number which happened to be the same number that the dealer had used on her original employment application.
The dealer pretending, up to this point, to not know the player rolled-over in an instant and confessed when NORA popped off a real-time alert: “The cheater is related to the dealer.”
Behind the scenes this was data finds data followed by relevance finds the user. Relevance, in this case, based on the principle; alert when the bag guy knows the good guy.
Had we deployed a traditional rules-based alert system, there was some chance the specific rule – if the employee’s job application phone number matches an arrest record – might have been missed. But because NORA was engineered around principles we caught this colluding roulette dealer. Notably, we would have also detected this had they been connected via an emergency contact phone number. Or maybe the player’s loyalty club card’s original address provided when they signed up (and since changed) was the same address used on the employee’s original job application (but not present on her current payroll record).
Data triage systems, especially those that must detect ever-changing crafty adversaries, should be principle-based where possible; otherwise, you won’t be one step behind. You will be at two or more steps behind!
Principle-based decisioning systems may surprise you … in a good way.
MISC NOTES
1. Maybe some classes of systems need a zillion rules, like the space shuttle program, for instance. But, that is out of my field so I don’t know.
2. The notion that “principles outperform rules” probably applies to most, if not all, of the decisioning processes. For example, I would prefer to see feature extraction, entity resolution, relevance detection, filtering, and insight publishing algorithms leverage principles over rules wherever possible.
3. Just to be fair, many systems will still have to have some very specific rules – like any transaction over $10,000 must be reported to FINCEN, it’s a law. This being not much different than telling your child they have to be home by 9pm on school nights, period.
4. And if you get to 10,000 principles, you might want to focus on more abstraction.
OTHER RELATED POSTS:
You Won’t Have to Ask -- Data Will Find Data and Relevance Will Find the User
When Federated Search Bites
Sat, 07/10/2010 - 08:51I am probably stepping on some folks’ toes. My apologies.
First, let me explain what I mean by federated search. Federated search: conducting a search against “n” source systems via a broadcast mechanism without the benefit or guidance of an index. This is somewhat like roaming the three buildings of the Library of Congress looking for a book title … without benefit of a card catalog.
I am speaking specifically about environments where the systems in the federation are heterogeneous, are physically dispersed, were not engineered for federation a priori, and are not managed by a common command and control system.
By way of example, an airline might have a payroll system containing employees, a reservation system containing flight reservations and a watch list database containing people that are not permitted to fly. If this airline implemented federated search the data in these three systems would remain in these three systems. Searches (whether invoked by users or machines) are then broadcast to each source system. Note: Source systems receive queries for information they may or may not have, and as we shall see, receive queries for data they may have but have no means to locate in any efficient manner.
Federated search works fine if the goal is simply a reference system used to answer periodic inquiry. Such systems could be described as forensic in nature – when there is something of interest, one can look for it. Think of such federated search environments as systems where “the data only speaks when spoken to.” If this is what an organization needs, and there are a small number of queries and a finite number of source systems, federated search is a fine option.
Most organizations are not living in a world where “after-the-fact forensic discovery delivered only when asked” is acceptable.
Most organizations have some obligation to make sense of what they know. For example, the airline should know if the person added to the watch list is already an employee or already has a flight reservation. Ideally, the moment such facts become knowable, someone or some system should be notified. Think of this as “the data speaks to itself.” I call this data finds data.
This notion of data finds data implies the “data is the query.” As each new piece of data enters the organization, the organization has just learned something. And it is at this exact moment in time that one (a smart system) must ask: Now that I know this, how does this relate to what I already know? Does this matter, and if so … to who?
Whether the data is the query (generated by systems likely at high volumes) or the user invokes a query (by comparison likely lower volumes), there is no difference. In both cases, this is simply a need for “discoverability” – the ability to discover if the enterprise has any related information.
If discoverability across a federation of disparate systems is the goal, federated search does not scale, in any practical way, for any amount of money. Period. It is so essential that folks understand this before they run off wasting millions of dollars on fairytale stories backed up by a few math guys with a new vision who have never done it before.
I will spare you the gory details of that day in 1996 when I came to witness such a federated search system. Multi-million dollar, very smart, middleware developed over a number of years was sitting atop a reported 2,000 data stores and 50B rows of data. Watching this large federated search system really drove home a series of epiphanies about the problems of federated search. Fortunately, the purpose of this particular system was a reference/forensic system that only had to respond to a relatively low volume of queries, primarily generated by users. And getting an incomplete answer from time-to-time would not be the end of the world.
To explain why federated search bites I will lay out three basic goals, three notional source systems, and four nasty problems (let’s call them challenges). Mind you, the greater the number of source systems, and the greater the transactional volumes, the more impossible it becomes to discover similar data across dissimilar systems (data finds data).
GOALS
Goal 1: Because the data must find the data, this means for every record added or updated in the federation one must determine if this information is related to any other records in the federation. Such discoverability must be able to keep up with transactional volumes therefore must be near-real-time. [Note: To keep this really simple let us say related only means: shares an exact passport number, address, or phone number.]
Goal 2: Users should be able to pose queries themselves. Although, as it turns out, this goal does not matter because the discoverability properties needed to deliver on Goal 1 can just as easily be applied to this goal.
Goal 3: The federated search system must be scalable across hundreds or more disparate source systems. As such, new source systems must be able to be added to the federation without adverse consequence to existing source systems in the federation, otherwise, the greater the number of systems the more unmanageable the environment.
SYSTEMS
Using the airline example, let’s say the three notional systems look like this:
System 1: A commercial-off-the-shelf payroll system (20K employees, <16 CPU’s, 200 transactions a day (subject to data finds data), system running at 90% utilization).
System 2: An airline reservation system (100M reservations, <265 CPU’s, 2,000 transactions a second, system running at 97% utilization).
System 3: A watch list database (subjects of interest) running on a commercial-off-the-shelf SQL database (1M records, <8 CPU’s, 1,000 changes a day, system running at 80% utilization).
CHALLENGES
Challenge 1: How will a new watch listing record containing a passport number (in System 3) efficiently locate related reservations records (in System 2) which share the same passport number? Here is the problem: An airline reservation system is typically designed to search on things like reservation number or fight number and date of departure not passport number. Source systems are optimized for their purpose –maintaining only the necessary indexes. And, if by chance passport number is an indexed and searchable field in the airline system, are the addresses and phone numbers indexed as well? And what about the key values in unstructured comment fields? Due to this issue, federated search can produce incomplete results because a source system may contain related records but cannot find them. Note: It is not practical to re-engineer every source systems to maintain all conceivable indexes.
Challenge 2: How will the payroll system (System 1) keep up with the flood of queries generated by the reservation system (System 2)? Here is the problem: The payroll system does not have the compute resources to sustain thousands of queries a second; it was not designed for that. Now maybe you are thinking why would you do that? Well data finds data is used to construct context (determine what one knows) in order to determine the right course of action. In this oversimplified example, maybe the airline likes to know when current or former employees make reservations so the right offers are made. Maybe terminated employees are not provided the same kind of offers as other former employees. Note: It is not practical to re-host the hardware of every source system such that it will be able to sustain the cumulative transactional volume of the federation.
Challenge 3: New information can be located during the federated search that warrants a re-query of the source systems. This is recursive. Imagine if the query is for a passport number that only exists in the watch listing database. But what if the watch listing database contains a matching record which reveals a new phone number? This newly discovered information, ideally, must be used to re-query the federated systems. For example, maybe there is a record in the reservation system with the same phone number and maybe this reservation contains a new address! Here is the problem: With each new feature discovered one must consider re-querying the source systems (again). Note: The hardware at each source system would not only have to support the transactional volume of the federation – but the recursive queries on top of that.
Challenge 4: Can you be sure all systems, across all the time zones, are all on-line, all at the same time? What if the fourth system added to the federation is a small, desktop application running a Microsoft Access database – will this system be left on-line at night and have high availability, failover system standing by? The issue is: Heterogeneous systems have non-uniform availability.
[Theatrical pause]
Just how sure am I that federated search cannot handle discoverability at scale? How about this: First person to describe a scalable federated search system that delivers on the goals and overcomes these technical challenges … in a practical way e.g., without having to re-host source system hardware … I’ll write you a personal check for $25,000 (see small print below).
So, if federated search is not the ideal approach for discoverability at scale, then what is?
Discovery at scale is best solved with some form of central directories or indexes. That is how Google does it (queries hit the Google indexes which return pointers). That is how the DNS works (queries hit a hierarchical set of directories which return pointers). And this is how people locate books at the library (the card catalog is used to reveal pointers to books).
Once a directory reveals a pointer, you can go fetch it. Federated fetch does scale. Yes, the source system will have to be on-line, in the same way the floor at the library must be open. Yes, the user will have to have access privileges. And yes, there are other challenges like the need to keep the directory current and semantically reconciled (to overcome the recursive issues described in Challenge 3). But, at least these are all tractable problems!
Truthfully, I would love to be proven wrong here for a variety of reasons e.g., the privacy ramifications of having large centralized database directories. Although, on the brighter side, the directory approach to discoverability results in fewer copies of the data floating around. And another plus may be that data governance (accountability, oversight, immutable audit logs, etc.) is going to be vastly easier to manage with a smaller number of central directories.
[Small Print: Offer good for two years from the date of this posting. If you have a solution in mind no need to physically prove it, just explain it on paper in plain English such that the average propeller-head can read it and go “oh yeah, that would work.” But, don’t spend too much time on this as it’s obviously not a fair challenge. I’m just trying to make a point as it seems a number of organizations, each desperate to quickly solve large scale discoverability, are being sold on the notion of federated search. An absolute waste of money.
RELATED POSTS
Federated Discovery vs. Persistent Context – Enterprise Intelligence Requires the Later
To Know Semantic Reconciliation is to Love Semantic Reconciliation
It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness
Discoverability: The First Information Sharing Principle
Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems
Mon, 05/31/2010 - 01:42I wrote an article with the above title. This article has since been published in the proceedings of the International Risk Assessment and Horizon Scanning Symposium 2010 (IRAHSS) in Singapore.
[Opening Excerpt]
Man continues to chase the notion that systems should be capable of digesting daunting volumes of data and making sufficient sense of this data such that novel, specific, and accurate insight can be derived without direct human involvement. While there are many major breakthroughs in computation and storage, advances in sensemaking systems have not enjoyed the same significant gains.
This article suggests that the single most fundamental capability required to make a sensemaking system is the system’s ability to recognize when multiple references to the same entity (often from different source systems) are in fact the same entity. For example, it is essential to understand the difference between three transactions carried out by three people versus one person who carried out all three transactions. Without the ability to determine when entities are the same, it quickly becomes clear that sensemaking is all but impossible.
Full article here.
I find most organizations have underestimated this principle: If a system cannot count, it cannot predict. While I covered this point in some detail in a previous post, this new article is more complete and has a section entitled Expert Counting Systems: Essential Ingredients For Sensemaking which covers such issues as:
- Expert counting engines should not rely on training data.
- Counted entities should accumulate features.
- Entities believed to be the same should be asserted as same.
- Expert counting benefits from favoring the false negatives.
- New observations should reverse earlier assertions.
- Full attribution/pedigree of each observation should be maintained.
- It should be fast in order to digest the historical data.
- It should be real time so that counting assertions can be made as the transaction is happening, in time to do something about it.
Anyway, long story short, expert counting is non-trivial, especially at scale, and lots more must be done in this area.
Miscellaneous Note: Over the years I’ve sometimes used the term Semantic Reconciliation (recognizing two things are the same despite having been described differently) to describe counting. And, many have heard me or others using the term Entity Resolution or Identity Resolution. Yes, more words that relate to counting … especially with respect to people or organizations: is this about one person or two? Unfortunately, trying to explain these terms to non-technical people has been a bit of work, so now in an attempt to make the concept more consumable … maybe the term “Expert Counting” is an improvement.
RELATED POSTS
Asserting Context: A Prerequisite for Smart, Sensemaking SystemsTo Know Semantic Reconciliation is to Love Semantic Reconciliation
Context: A Must-Have and Thoughts on Getting Some …
Entity Resolution Systems vs. Match Merge/Merge Purge/List De-duplication Systems
Sequence Neutrality in Information Systems
Big Breakthrough in Performance: Tuning Tips for Incremental Learning Systems
On A Smarter Planet … Some Organizations Will Be Smarter-er Than Others





