Tag: Architecture

NOTE: SAS "Inside" of Hadoop

We previously looked at SAS Grid Manager for Hadoop, which brings workload management, accelerated processing, and scheduling to a Hadoop environment. This was introduced with the m3 maintenance release of SAS v9.4. M3 also introduced support for using…

NOTE: HTML 5 is in VA Hub Already!

Aside from comments about my SAS Enterprise Guide vs SAS Studio article, Metacoda’s Michelle Homes (@HomesAtMetacoda) was quick to write a comment about my Flash & SAS Visual Analytics (VA) article and to point out that HTML5 is already an option f…

NOTE: Your Response: EG & Studio

As Mark Twain is oft (incorrectly) quoted as saying: “Reports of my death are much exaggerated”. I didn’t say that Enterprise Guide (EG) was anywhere close to death when I (contentiously) wrote NOTE: What is SAS Studio? RIP Enterprise Guide? but I…

Hadoop is the New Black

It feels like any SAS-related project in 2015 not using Hadoop is simply not ambitious enough. The key question seems to be “how big should our Hadoop cluster be” rather than “do we need a Hadoop cluster”.

Of course, I’m exaggerating, not every project needs to use Hadoop, but there is an element of new thinking required when you consider what data sources are available to your next project and what value would they add to your end goal. Internal and external data sources are easier to acquire, and volume is less and less of an issue (or, stated another way, you can realistically aim to acquire large and larger data sources if they will add value to your enterprise).

Whilst SAS is busy moving clients from PC to web, there’s a lot of work being done by SAS to move the capabilities of the SAS server inside of Hadoop. And that’s to minimise “data miles” by moving the code to the data rather than vice-versa. It surely won’t be long before we see SAS Grid and LASR running inside of Hadoop. It’s almost like Hadoop has become a new operating system on which all of our server-side capabilities must be available.

We tend to think of Hadoop as being a central destination for data but it doesn’t always start its presence in an organisation in that way. Hadoop may enter an organisation for a specific use case, but data attracts data, and so once in the door Hadoop tends to become a centre of gravity. This effect is caused in no small part by the appeal of big data being not just about the data size, but the agility it brings to an organisation.

SAS’s Senior Director of the EMEA and AP Analytical Platform Centre of Excellence, Mark Torr (that’s one heck of a title Mark!) recently wrote a well-founded article on the four levels of Hadoop adoption maturity based upon his experiences with many SAS customers. His experiences chime with my far more limited observations. Mark lists the four levels as:

  1. Monitoring – enterprises that don’t yet see a use for Hadoop within their organisation, or are focused on other priorities
  2. Investigating – those at this level have no clear, focused use for Hadoop but they are open to the idea that it could bring value and hence they are experimenting to see where and how it can deliver benefit(s)
  3. Implementing – the first one or two Hadoop projects are the riskiest because there’s little or no in-house experience, and maybe even some negative political undercurrents too. As Mark notes, the exit from Investigating into Implementing often marks the point where enterprises choose to move from the Apache distribution to a commercial distribution that offers more industrial-strength capabilities such as Hortonworks, Cloudera or MapR
  4. Established – At this level, Hadoop has become a strategic architectural tool for organisations and, given the relative immaturity of Hadoop, the organisations are working with their vendors to influence development towards full production-strength capabilities
Hadoop is (or will be) a journey for all of us. Many organisations are just starting to kick the tyres. Of those who are using Hadoop, most are in the early stages of this process in level 2, with a few front-runners living at level 3. Those organisations at leve 3 are typically big enough to face and invest in solutions to the challenges that the vendors haven’t yet stepped up to, such as managing provenance, data discovery and fine-grained security.

Does anybody live the dream fully yet? Arguably, yes, the internal infrastructures developed at Google and Facebook certainly provide their developers with the advantages and agility of the data lake dream. For most us, we must be content to continue our journey…


Follow me on Twitter: @aratcliffeuk

NOTE: Reverse-Engineering Technical Debt

I wrote a couple of items about technical debt back in November (here and here). Sometimes you don’t choose to create debt for yourself, sometimes it’s inherited. In technical guises, debt can be inherited when teams merge, for instance.In such circums…

NOTE: Enterprise Guide vs DI Studio – What’s the difference?

A favourite interview question of mine is: Compare and contrast SAS 9’s stored process server and workspace server. This question is very good at revealing whether candidates actually understand some of what’s going on behind the scenes of SAS 9. I mentioned this back in 2010, together with some notes on my expectations for an answer.

I was amused to see Michelle Homes post another of my favourite interview questions on the BI Notes blog recently: What’s the difference between SAS Enterprise Guide and SAS DI Studio? This question, and the ensuing conversation, establishes whether the candidate has used either or both of the tools, and it reveals how much the candidate is thinking about their environment and the tools within.

For me, there are two key differences: metadata, and primary use.

Michelle focuses on the former and gives a very good run-down of the use of metadata in Data Intergration Studio (and the little use in Enteprise Guide).

With regards to primary use, take a look at the visual nodes available in the two tools. The nodes in DI Studio are focused upon data extraction, transformation and loading (as you would expect), whilst the nodes in Enterprise Guide (EG) are focused upon analysing data. Sure, EG has nodes for sorting, transposing and other data-related activities (including SQL queries), but the data manipulation nodes are not as extensive as DI Studio. In addition to sorting and transposing, DI Studio offers nodes that understand data models, e.g. an SCD loader and a surrogate key generator (I described slowly changing dimensions (SCDs) and other elements of star schema data models in a post in 2009). On the other hand, EG has lots of nodes for tabulating, graphing, charting, analysing, and modelling your data.

One final distinction I’d draw is that EG’s nodes are each based around one SAS procedure, whilst DI’s nodes are based around an ETL technique or requirement. You can see that DI Studio was produced for a specific purpose, whilst EG was produced as a user friendly layer to put on top of the SAS language and thereby offers a more generalistic solution.

For the most part, I’m stating the obvious above, but the interview candidate’s answer to the question provides a great deal of insight into their approach to their work, their sense of curiosity and awareness, and their technical insight.


Follow me on Twitter: @aratcliffeuk

See an audiovisual recording on my SAS Global Forum 2013 paper Visual Techniques for Problem Solving and Debugging

More on Technical Debt #2/2

Last week I offered some techniques for management of technical debt. In this post I offer some more.Technical debt is a debt that you incur every time you avoid doing the right thing (like refactoring, removing duplication/redundancy), thereby letting…