Google changes the algorithm; nothing new but what about the bias of coders?

Image001

Here is the thinking, which has wider implications than a small change at Google….

If you took a complex algorithm and asked a 15 year old, a 30 year old and a 65 year old; both male and female, from different countries, using different computing languages and compliers to cut some code: will you get the same output from the same test datasets using the different implementations of the algorithm? – Probably not!

So changing the algorithm is one thing; changing compliers (and who coded that), language and the age, sex, experience (life and skills) of the coders is another……but we depend on them.

Yes there are tools to help ensure maintainability, supportability, scalability, performance and conformity but we do have a massive and increasing reliance on the coders ethics and lack of bias in the way the interrupt an algorithm……just wondering who is thinking about this as well.

Why this is important to digital footprints. Someone you don’t know is taking your data and predicting your future based on their and others bias…..

-------

From the Google blog Ten recent algorithm changes

11/14/11 | 8:30:00 AM

Today we’re continuing our long-standing series of blog posts to share the methodology and process behind our search ranking, evaluation and algorithmic changes. This summer we published a video that gives a glimpse into our overall process, and today we want to give you a flavor of specific algorithm changes by publishing a highlight list of many of the improvements we’ve made over the past couple weeks.

We’ve published hundreds of blog posts about search over the years on this blog, our Official Google Blog, and even on my personal blog. But we’re always looking for ways to give you even deeper insight into the over 500 changes we make to search in a given year. In that spirit, here’s a list of ten improvements from the past couple weeks:

  • Cross-language information retrieval updates: For queries in languages where limited web content is available (Afrikaans, Malay, Slovak, Swahili, Hindi, Norwegian, Serbian, Catalan, Maltese, Macedonian, Albanian, Slovenian, Welsh, Icelandic), we will now translate relevant English web pages and display the translated titles directly below the English titles in the search results. This feature was available previously in Korean, but only at the bottom of the page. Clicking on the translated titles will take you to pages translated from English into the query language.
  • Snippets with more page content and less header/menu content: This change helps us choose more relevant text to use in snippets. As we improve our understanding of web page structure, we are now more likely to pick text from the actual page content, and less likely to use text that is part of a header or menu.
  • Better page titles in search results by de-duplicating boilerplate anchors: We look at a number of signals when generating a page’s title. One signal is the anchor text in links pointing to the page. We found that boilerplate links with duplicated anchor text are not as relevant, so we are putting less emphasis on these. The result is more relevant titles that are specific to the page’s content.
  • Length-based autocomplete predictions in Russian: This improvement reduces the number of long, sometimes arbitrary query predictions in Russian. We will not make predictions that are very long in comparison either to the partial query or to the other predictions for that partial query. This is already our practice in English.
  • Extending application rich snippets: We recently announced rich snippets for applications. This enables people who are searching for software applications to see details, like cost and user reviews, within their search results. This change extends the coverage of application rich snippets, so they will be available more often.
  • Retiring a signal in Image search: As the web evolves, we often revisit signals that we launched in the past that no longer appear to have a significant impact. In this case, we decided to retire a signal in Image Search related to images that had references from multiple documents on the web.
  • Fresher, more recent results: As we announced just over a week ago, we’ve made a significant improvement to how we rank fresh content. This change impacts roughly 35 percent of total searches (around 6-10% of search results to a noticeable degree) and better determines the appropriate level of freshness for a given query.
  • Refining official page detection: We try hard to give our users the most relevant and authoritative results. With this change, we adjusted how we attempt to determine which pages are official. This will tend to rank official websites even higher in our ranking.
  • Improvements to date-restricted queries: We changed how we handle result freshness for queries where a user has chosen a specific date range. This helps ensure that users get the results that are most relevant for the date range that they specify.
  • Prediction fix for IME queries: This change improves how Autocomplete handles IME queries (queries which contain non-Latin characters). Autocomplete was previously storing the intermediate keystrokes needed to type each character, which would sometimes result in gibberish predictions for Hebrew, Russian and Arabic.

If you’re a site owner, before you go wild tuning your anchor text or thinking about your web presence for Icelandic users, please remember that this is only a sampling of the hundreds of changes we make to our search algorithms in a given year, and even these changes may not work precisely as you’d imagine. We’ve decided to publish these descriptions in part because these specific changes are less susceptible to gaming.

For those of us working in search every day, we think this stuff is incredibly exciting -- but then again, we’re big search geeks. Let us know what you think and we’ll consider publishing more posts like this in the future.

Are you more than a social graph?

Image001

If the web is a "Social" something then this equals Facebook, Xing, Twitter, LinkedIn Google+. 

But social could mean....

"see what your friends are searching, buying, watching, liking, saying or doing"

"buy together and recommend

"filtered by who you know"

"what's trending"

"where are my friends right now or where will they be"

Given that a social graph is a digital map that says, "This is who I know." It may reflect people who the user knows in various ways: as family members, work colleagues, peers met at a conference, high school classmates, fellow cycling club members, friend of a friend, etc. Social graphs are mostly created on social networking sites like Facebook and LinkedIn, where users send reciprocal invites to those they know, in order to map out and maintain their social ties.

And an interest graph is a digital map that says, "This is what I like." As Twitter's CEO has remarked, if you see that I follow the San Francisco Giants on Twitter, that doesn't tell you if I know the team's players, but it does tell you a lot about my interest in baseball. Interest graphs are generated by the feeds customers follow (e.g. on Twitter), products they buy (e.g. on Amazon), ratings they create (e.g. on Netflix), searches they run (e.g. on Google), or questions they answer about their tastes (e.g. on services like Hunch).

However where is the value? It has to be in the mashup/ combination of all social data so I can determine who influences you and who you influence -  where is your power and how much you are worth to a brand.......

Google doesn't want your identity - it wants the data that gives you identity

Image001

It’s official: Google wants to own your online identity is the article is from GigaOm http://gigaom.com/2011/08/29/its-official-google-wants-to-own-your-online-identity/ using the same Image from Kat B Photography

So Schmidt told it like it is at Edinburgh that an “identity service” unlocks the ability to do the trade and everyone goes into melt down. Why are you somewhat shocked that Google + plus  > than competing with Facebook.  As covered in numerous posts here previously, (social) signals are a critical part of Big Data but signals from real, authenticated, trusted real people with an identity means that you undertake a real "trade".

Now lets not get sidelined by Real Name policy issues and the wider political implications;  lets just focus on the "trade or barter."  You give up data for access to FREE services, but the data cannot be identified means the value is smaller than knowing who you are. If they know who you are, the balance of value is firmly with the holder. 

The issue is not about being (or becoming) an Identity Gatekeeper as that will end in regulatory quagmire and in reality you cannot own an Identity, just as you cannot demand faith, command trust or request a reputation. Therefore, lets assume a world in which there is an economy where real people have real cash who want to spend said real cash on real products and services, then knowing who you are could kind of like be helpful.

This is not about identity but is about how you trade for goods.  Image a token with your face on it, which represents your ability to trade? - called money.

The value for content is more complex than just context...

Image001

This is a follow up to a discussion about Identity with my friend and fellow professional Nicky Hickman about our favourite topics of identity and where does value originate.  We were doing the rounds on the value of social media data, signals, pulses, waves and I got thinking about this slightly dated 2005 Mobile Web 2.0 diagram.  Ajit and I wrote about the changing nature of who was creating content and how there was a shift in the balance of power from professional to consumer, something the editor of our book did not agree at the time with but that is another story!    The chart shows how events accumulate different value depending on time and how it is consumed.

Why go back and rethink.... One of the four value points was "new" but this was about new content (value based on consumption) and not about social content (the continual stream of personal data) which is about signals that create spikes, pulses, waves and trends.  There is a realisation now that your social data (data about you, your location, your movements, your family, your preferences, your recommendations etc) has more value than your content but also this data (signals, spikes, pulses, waves and trends) are what make you unique and together form an identity. 

The question however is who is the best signal generator? And who is the best provider of analysis of your data to produce value?

Something I did not realise Google was doing - cleaver or creepy?

Image002

According to Samy Kamkar on this Blog post - and having tried it, it is spot on.

android map exposes the data that Google has been collecting from virtually all Android devices and street view cars, using them essentially as global wardriving machines. 

When the phone detects any wireless network, encrypted or otherwise, it sends the BSSID (MAC address) of the router along with signal strength, and most importantly, GPS coordinates up to the mothership. This page allows you to ping that database and find exactly where any wi-fi router in the world is located. 

You can enter any router BSSID/MAC address to locate the exact physical location below, or try his demonstration router by hitting "Probe"

------

Personally tried and it is 100% spot on as per my image. I then looked up the IP address and this told me who owns the IP pipe and most likely the company I am sitting at.  I could do the same with a search at Companies House and the address or you could just phone me up as ask me.   

Am I worried that I can be tracked no.  Am I worried that someone could exploit this raw data, maybe. Am I worried that the data and its subsequent analysis is hidden and I cannot get access to it - probably. Am I worried that I cannot own my own data, signals and value - Yes

The social genome: Could the Real time web do for Retail as advertising did for Search?

The original article is by Ajit Jaokar

Could the Real time web do for Retail as advertising did for Search?

Ajit's blog arises from two recent conversations....

-          Earlier this week, I was in Brussels and discussed the future of the Web in a number of conversations with MEPs  which was based on the significance of the Real Time Web and

-          I blogged about the significance of the Real time Web in conversation with @tonia_ries;  organizer of The Real Time Report conference in New York, which I am attending in June.

Tonia came up with a succinct equation: value (for content) = time + place + shared interest

Coming from a background of mobile and social media, for a long time, Telecom Operators have drooled at the idea of the proverbial ‘starbucks model’. While Starbucks never launched any such service as far as I know, the model went like this: When a customer passed near a Starbucks, they would get an SMS offering them 10% off the price of coffee. Telecoms, with it’s relatively closed mindset could never launch such a service but assuming you had a permission based relationship with the retailer, the model is viable.

It needs:

a)      Customers to trust you

b)      Access to real time data and historical data

c)       Awareness of context

d)      An open ecosystem (else you have small silos of data and customers which make it unviable)

e)      Real time interactions

The prevailing thinking was:

Google could be the store of all this data and that we, as customers, will give up all our data to Google

OR

The Telecom Operator would know who you are and where you are. They would be the providers of this and provide that information real time via SMS (and be paid by the retailer ofcourse)

But customers were not that stupid and maybe not the  Retailers as well!

Retailers may finally have woken up from their apathy and decided that they need not simply abdicate the relationship they share with the customer to either Telecoms or to Google. The real time web may provide an alternative to play on their existing strengths but still leverage the open ethos of the Web

It appears that customers are not choosing a single web brand for various services but rather that they are choosing different brands for distinct services – ex Twitter for real time web, facebook for social, foursquare for check-ins and Google for search. Today, Google is far from dominating at least three incarnations of the Web post Google – The Real time Web, the Social web and the ‘Location Web’ (check-ins) (which explainsGoogle’s recent emphasis on winning  the social web)

Looking at it from a customer standpoint, How do we define value?

Value could be either

a)      The customer pays for something that they find useful (traditional definition of value)

b)      The customer gets  something useful for free in return for advertisements + relinquishing some control of their data(Google)

c)       The customer gets information that is actionanble in real time in return for data

The Web provided one form of indirect monetization through the advertising model. But the advertising model does not suit all providers (although it always suits Google). The real time web could provide an alternative for retail as advertising did for search.

There is increasing evidence for this: Wal-Mart may have paid $300M+ for Kosmix . Kosmix appears to be a mixture of three things: TweetBeat, RightHealth and a web service to explore the web by topic. But the premium for kosmix may be for the underlying ‘social genome’ technology.

In the announcement blog post, the founder,  Anand Rajaraman says it’s the “social genome” technology underlying the company’s products: Conversations in social media revolve around “social elements” such as people, places, topics, products, and events. For example, when I tweet “Loved Angelina Jolie in Salt,” the tweet connects me (a user) to Angelia Jolie (an actress) and SALT (a movie). By analyzing the huge volume of data produced every day on social media, the Social Genome builds rich profiles of users, topics, products, places, and events.

Wal-Mart wants to bring this technology to shoppers, offering them “integrated experiences that leverage the store, the web, and mobile, with social identity being the glue that binds the experience,” Rajaraman says.

If this is accurate, then it is indeed possible that the Real time web do for retail as advertising did for search.

Note: In this blog, I use the terms ‘Real time web’ and the ‘Real time internet’ loosely and interchangibly. The objective is simply to focus on real time interactions and I use both terms to signify the same.

If you are attending the Realtime report NY 2011 event on June 6 at BB King’s in Times Square NY, say hello to Ajit

I wrote about Do I want control over the things that have value or my privacy settings? last week, the crux of this is that to deliver real-time I need to release some control over my privacy, but this does not allow me access to the value.

Theory is great but technology is ahead of experience and law, which within itself is not a bad thing but it needs to be said.

Do I want control over the things that have value or my privacy settings?

Image002

Signals, spikes, pulses and waves are data that represent your real-time life and controlling them is game changing. The data that underpins signals, spikes, pulses and waves are in silos controlled by the likes of social media, TV, mobile and credit card companies who really don’t want to share, even though it is your data and you could benefit.

Privacy as control

There is a battle being fought for the protection of your data under the banner of privacy and tends to focus on your actual raw data (think tweets, location, blogs, updates or Facebook status) but do I really care about this short term, transient, real-time data or should I care more about what someone can do with the data. Yes control of the source prevents access but does not lead to control of the outcome, but when I consider that my data, on its own, has little value until it is put in context with others from my social group it changes my perception.

Considering the central part of the diagram. Knowing that with only a few exceptions we are not able to create a trigger (spike) that will get lot of people talking and create a trend.  But a spike needs the community to create reflections (signals) and refinement (pulses) before we come back to stable again (underlying signals), maybe at a different place but stable. All this raw data is real-time, fast changing and jolly interesting but how does it effect my slowly built and refined reputation, influence, authority, relevancy, preference, credibility, trust and reach; as these are what I really worry about, these create value for me and what I do want is to control and protect them.

Knowing that real-time social media interactions with spikes, pulses, waves and signals help to refine these personal valuable assets, who is creating this value for me and how do I gain control.

Here’s the issue.

Privacy allows you to control what data is shared. If you share nothing the only reputation, authority or influence you have if from what others say (not bad if you can) However, once your data is out there (assuming it is not trapped in a silo) anyone can do anything with it and there are many companies trying to create services that determine amongst other ideals, influence.  The algorithms these companies use will continually pick up your data as part of your interactions with spikes, signals, pulses and waves and use this to refine how you should be ranked, but because the data is freely available and they publish their view of you freely your have no control or where is goes, up or down.

PageRank

PageRank was ideal when the need was for slow, authority driven search.  If you understood the algorithm you could affect your standing in the results, hence a good reason to keep it hidden to avoid bias. The same concept would appear to apply to digital influence.  If you know how the algorithm works and what data it is looking for, you can quickly modify your behaviour and rise up their ranking, leagues, ladders and competitions.  In these cases, your influence and reputation is primarily built on what you say about yourself.  PageRank was built on authority (reciprocal links) and in this real-time web we have moved from authority of others to what you say about yourself as the prime driver, which could be a little dangerous. But if others are unwilling to share their data about themselves or about you - we may grind to a halt.

So What

Privacy protects the input but does not give you control of the value

Not sharing (staying private) means we depend more on what some say about themselves which brings about a significant bias

Real time data should be thrown away fast and ranked the lowest in terms of changing important value services

Determining reputation, influence, authority, relevancy, preference, credibility, trust and reach is slow and hard in both the real and digital world.

I would welcome views and input on this….tony

Real time search for spikes, pulses, waves and trends is a game changer

Image001

This is a diagram from Ajit Jaokar and my book on “Mobile Web 2.0” way back in 2006 when social was new and search was stable.

The reason I have dug this up is due to the fundamental changes happening in search.  In old currency terms, search was about indexing the web and making results available and using preferences to improve the relevancy of results.  There was a significant delay (i.e. it was not real time) between content becoming available on-line and it being indexed and then being found via search.  A fundamental to get on page 1 of a search was to be referenced by quality and or authoritative sources.  If your desire was for authority this model worked!

The web has become closer to a real-time experience with the advent of service such as Twitter and Facebook and these services are effecting how we discover and interact with content.

Search now needs to help you discover spikes, pulses, waves and trends and this is a whole new problem and it is likely to change who we see as key players as this real-time web world needs access to new data and collection which some will fight to keep in their controlling silo.

What are the inputs for social signals and digital footprints?

We are on the look out in the digital world for social signals, spikes, pulses, wave and trends and how they are combine to delivery value propositions such as authority, influence and reputation

In the physical we are trying to take a known set of inputs and create algorithms so we can interpret what is happening.  However human interaction is not just about the here and now but has deep dependencies on personal history; which introduces a degree of randomness.

It is no easier to look for these interpretations when we start to look in detail at what digital data could tell us.  In the book I took a long deep look at what you can do with this data and concluded the value was in the analysis not in the collection.

Whilst there are some obvious differences between physical social signals and digital ones an important one is about collection of data as in the physical world you have to collect face, distance and audio data before you can start to work what is going on. 

Social Signal (physical)

Social Signals & Digital Footprint inputs

Gesture

Facebook Like or Google +1 responses

Number of friends & connections

Distance

Number of mentions & replies

On-line history and IP data

Posture

Appearance in lists

Favourites

Height

Blogs, tweets  & comments

Recommendations from you

Gaze

Location and Routes & routines

Attitude to privacy settings

Vocal behaviour

3rd party reviews of your data

Email, SMS, BBM and IM data

Unfriending & unfollow

Personal details

TV  and viewing history

Purchase history

Preferences

Attention

Clicks & key strokes

Device history

Words & grammar

Recommendations about you

What are the definitions for social signals, pulses and waves ?

Image001

 

This is about social media using engineering terms to try and define/ categorise patterns being seen or looked for in your data.

For the purposes of this blog I am currently defining the following:-

Social signal (physical) - think physical behavioural signals you give off when interacting. A seemingly erratic behaviour that routine, regular, repeatable and actually has a defined pattern irrespective of who you are.

Social signal (digital) - think digital behavioural signals that are a continual feed from digital interactions.

Social spike - think spike from a crowd doing the same thing for a short time and then moving on.

Social pulse - think regular pattern or behaviour from a crowd when stimulated.

Social wave - think growing sentiment of change from a crowd doing something different and moving to a new normal.

Social trend - think underlying slow change in the crowd.