This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
“We been through every kind of rain there is. Little bitty stingin’ rain, and big ol’ fat rain, rain that flew in sideways, and sometimes rain even seemed to come straight up from underneath.” Was that a quote from the Forrest Gump movie, or something said regarding Hurricane Florence? Could be either one! Hurricane Florence recently came through […]
The post Hurricane Florence: rainfall totals in the Carolinas appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
SAS Press author Kirk Paul Lafler’s favorite tips using PROC SQL.
The post Three Proc SQL Tips from Bestselling Author Kirk Paul Lafler appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Essential SAS tools to bring to your next hackathon appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
To succeed in any data-focused hackathon, you need a robust set of tools and skills – as well as a can-do attitude. Here’s what you can expect from any hackathon:
In my experience, hackathons are often a great melting pot of different tools and technologies. Whatever tech biases you might have in your day job (Windows versus Linux, SAS versus Python, JSON versus CSV) – these melt away when your teammates show up ready to contribute to a common goal using the tools that they each know best.
At the Analytics Experience 2018 Hackathon, attendees have the entire suite of SAS tools available. From Base SAS, to SAS Enterprise Guide, to SAS Studio, to SAS Enterprise Miner and the entire SAS Viya framework — including SAS Visual Analytics, SAS Visual Text Analytics, SAS Data Mining and Machine Learning. As we say here in San Diego, it’s the whole enchilada. As the facilitators were presenting the whirlwind tour of all of these goodies, I could see the attendees salivating. Or maybe that was just me.
When it comes to getting my hands dirty with unknown data, my favorite path begins with SAS Enterprise Guide. If you know me, this won’t surprise you. Here’s why I like it.
Hackathon data almost always comes as CSV or Excel spreadsheets. The Import Data task can ingest CSV, fixed-width text, and Excel spreadsheets of any version. Of course most “hackers” worth their salt can write code to read these file types, but the Import Data task helps you to discover what’s in the file almost instantly. You can review all of the field names and types, tweak them as you like, and click Finish to produce a data set. There’s no faster method of turning raw data into a SAS data set that feeds the next step.
See Tricks for importing text files and Importing Excel files using SAS Enterprise Guide for more details about the ins-and-outs of this task. If you want to ultimately turn this step into repeatable code (a great idea for hackathons), then it’s important to know how this task works.
Note: if your data is coming from a web service or API, then it’s probably in JSON format. There’s no point-and-click task to read that, but a couple of SAS program lines will do the trick.
The Query Builder in SAS Enterprise Guide is a one-stop shop for data management. Use this for quick filtering, data cleansing, simple recoding, and summarizing across groups. Later, when you have multiple data sources, the Query Builder provides simple methods to join these – merge on the fly.
Before heading into your next hackathon, it’s worth exploring and practicing your skills with the Query Builder. It can do so much — but some of the functions are a bit hidden. Limber up before you hack!
See this paper by Jennifer First-Kluge for an in-depth tour of the tool.
If you’ve never seen your data before, you’ll appreciate this one-click method to report on variable types, frequencies, distinct values, and distributions. The Describe->Characterize Data task provides a good start.
Using SAS Studio? There’s a Characterize Data task in there as well. See Marje Fecht’s paper: Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide for more about this and other tasks.
“Long” data is typically best for reporting, while “wide” data is more suited for analytics and modeling The process of restructuring data from long to wide (or wide to long) is called Transpose. SAS Enterprise Guide has special tasks called “Split Data” (for making wide tables) and “Stack Data” (for making long data). Each method has some special requirements for a successful transformation, so it’s worth your time to practice with these tasks before you need them.
The program editor in SAS Enterprise Guide is my favorite place to write and modify SAS code. Here are my favorite tricks for staying productive in this environment including code formatting, shown below.
Have another favorite editor? You can use SAS Enterprise Guide to open your code in your default Windows editor too. That’s a great option when you need to do super-fancy text manipulation. (We won’t go into the “best programming editor” debate here, but I’ve got my defaults set up for Notepad++.)
The hackathon “units of sharing” are code (of course) and data. SAS Enterprise Guide provides several simple methods to share data in a way that just about any other tool can consume:
When it comes to sharing code, you can use File->Export All Code to capture all SAS code from your project or process flow. However, I prefer to assemble my own “standalone” code piecemeal, so that I can make sure it’s going to run the same for someone else as it does for me. To accomplish this, I create a new SAS program node and copy the code for each step that I want to share into it…one after another. Then I test by running that code in a new SAS session. Validating your code in this way helps to reduce friction when you’re sharing your work with others.
The obvious benefit of hackathons is that at the end of a short, intense period of work, you have new insights and solutions that didn’t have before – and might never have arrived at on your own. But the personal benefit comes in the people you meet and the techniques that you learn. I find that I’m able to approach my day job with fresh perspective and ideas – the creativity keeps flowing, and I’m energized to apply what I’ve learned in my business.
The post Essential SAS tools to bring to your next hackathon appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
The post Linearly spaced vectors in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The SAS/IML language and the MATLAB language are similar. Both provide a natural syntax for performing high-level computations on vectors and matrices, including basic linear algebra subroutines. Sometimes a SAS programmer will convert an algorithm from MATLAB into SAS/IML. Because the languages are not identical, I am sometimes asked, “what is the SAS/IML function that is equivalent to the XYZ function in MATLAB?”
One function I am often asked about is the linspace function in MATLAB, which generates a
row vector of n evenly spaced points in a closed interval.
Although I have written about how to generate evenly spaced points in SAS/IML (and in the DATA step, too!), the name of the SAS/IML function that performs this operation (the DO function) is not very descriptive. Understandably, someone who browses the documentation might pass by the DO function without realizing that it is the function that generates a linearly spaced vector.
This article shows how to construct a SAS/IML function that is equivalent to the MATLAB linspace function.
Syntactically, the main difference between the DO function in SAS/IML and the linspace function in MATLAB is that the third argument to the DO function is a step size (an increment), whereas the third function to the linspace function is the number of points to generate in an interval. But that’s no problem: to generate n evenly spaced points on the interval [a, b], you can use a step size of (b – a)/(n – 1). Therefore, the following SAS/IML function is a drop-in replacement for the MATLAB linspace function:
proc iml; /* generate n evenly spaced points (a linearly spaced vector) in the interval [a,b] */ start linspace(a, b, numPts=100); n = floor(numPts); /* if n is not an integer, truncate */ if n < 1 then return( {} ); /* return empty matrix */ else if n=1 then return( b ); /* return upper endpoint */ return( do(a, b, (b-a)/(n-1)) ); /* return n equally spaced points */ finish; |
A typical use for the linspace function is to generate points in the domain of a function so that you can quickly visualize the function on an interval. For example, the following statements visualize the function exp( -x^{2} ) on the domain [-3, 3]:
x = linspace(-3, 3); /* by default, 100 points in [-3,3] */ title "y = exp( -x^2 )"; call series(x, exp(-x##2)); /* graph the function */ |
This is a good time to remind everyone of the programmer’s maxim (from Kernighan and Plauger, 1974, The Elements of Programming Style) that “10.0 times 0.1 is hardly ever 1.0.” Similarly, “5 times 0.2 is hardly ever 1.0.” The maxim holds because many finite decimal values in base 10 have a binary representation that is infinite and repeating. For example, 0.1 and 0.2 are represented by repeating decimals in base 2. Specifically, 0.2_{10} = 0.00110011…_{2}. Thus, just as 3 * (0.3333333) is not equal to 1 in base 10, so too is 5 * 0.00110011…_{2} not equal to 1 in base 2.
A consequence of this fact is that you should avoid testing floating-point values for equality. For example, if you generate evenly spaced points in the interval [-1, 1] with a step size of 0.2, do not expect that 0.0 is one of the points that are generated, as shown by the following statements:
z = do(-1, 1, 0.2); /* find all points that are integers */ idx = loc( z = int(z) ); /* test for equality (bad idea) */ print (idx // z[,idx])[r={'idx', 'z[idx]'}]; /* oops! 0.0 is not there! */ print z; /* show that 0.0 is not one of the points */ |
When you query for all values for which z = int(z), only the values -1 and +1 are found. If you print out the values in the vector, you’ll see that the middle value is an extremely tiny but nonzero value (-5.55E-17). This is not a bug but is a consequence of the fact that 0.2 is represented as a repeating value in binary.
So how can you find the points in a vector that “should be” integers (in exact arithmetic) but might be slightly different than an integer in floating-point arithmetic?
The standard approach is to choose a small distance (such as 1e-12 or 1e-14) and look for floating-point numbers that are within that distance from an integer. In SAS, you can use the ROUND function or check the absolute value of the difference, as follows:
eps = 1e-12; w = round(z, eps); /* Round to nearest eps */ idx = loc( int(w) = w); /* find points are within epsilon of integer */ print idx; idx = loc( abs(int(z) - z) < eps ); /* find points whose distance to integer is less than eps */ print (idx // z[,idx])[r={'idx', 'z[idx]'}]; |
In summary, this article shows how to define a SAS/IML function that is equivalent to the MATLAB linspace function. It also reminds us that some finite decimal values (such as 0.1 and 0.2) do not have finite binary representations. When these values are used to generate an arithmetic sequence, the resulting vector of values might be different from what you expect. A wise practice is to never test a floating-point value for equality, but instead to test whether a floating-point value is within a small distance from a target value.
The post Linearly spaced vectors in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This tip was prompted by a SAS Communities question which I hear from time to time, essentially “How do I find out which groups a SAS user is a Portal Group Content Administrator for?” It can be answered using the Metacoda Identity Permissions Explorer but involves a few steps so I will outline them here. … Continue reading “Metacoda Plug-ins Tip: User’s Group Content Admin Permissions (Identity Permissions Explorer)”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
The post Two interfaces for typing text by using a TV remote control appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Have you ever tried to type a movie title by using a TV remote control? Both Netflix and Amazon Video provide an interface (a virtual keyboard) that enables you to use the four arrow keys of a standard remote control to type letters.
The letters are arranged in a regular grid and to navigate from one letter to another can require you to press the arrow keys many times. Fortunately, the software displays partial matches as you choose each letter, so you rarely need to type the entire title. Nevertheless, I was curious: Which interface requires the fewest number of key presses (on average) to type a movie title by using only the arrow keys?
The following images show the layout of the navigation screen for Netflix and for Amazon Video.
The Netflix grid has 26 letters and 10 numbers arranged in a 7 x 6 grid. The letters are arranged in alphabetical order.
The first row contains large keys for the Space character (shown in a light yellow color) and the Backspace key (dark yellow). Each of those keys occupies three columns of the grid, which means that you can get to the Space key by pressing Up Arrow from the A, B, or C key.
When you first arrive at the Netflix navigation screen, the cursor is positioned on the A key (shown in pink).
The letters in the Amazon Video grid are arranged in a 3 x 11 grid according to the standard QWERTY keyboard layout.
When you first arrive at the navigation screen, the cursor is positioned on the Q key.
The Space character can be typed by using a large key in the last row (shown in a light yellow color) that spans columns 8, 9, and 10. The Space character can be accessed from the M key (press Right Arrow) or from the K, L, or ‘123’ keys on the second row.
The ‘123’ key (shown in green) navigates to a second keyboard that contains numbers and punctuation.
The numbers are arranged in a 1 x 11 grid.
When you arrive at that keyboard, the cursor is on the ‘ABC’ key, which takes you back to the keyboard that contains letters. (Note: The real navigation screen places the ‘123’ key under the 0 key. However, the configuration in the image is equivalent because in each case you must press one arrow key to get to the 0 (zero) key.) For simplicity, this article ignores punctuation in movie titles.
I recently wrote a mathematical discussion about navigating through grids by moving only Up, Down, Left, and Right.
The article shows that nearly square grids are more efficient than short and wide grids, assuming that the letters that you type are chosen at random.
A 7 x 6 grid requires an average of 4.23 key presses per character whereas a
4 x 11 grid requires an average of 4.89 key presses per character. Although the difference might not seem very big, the average length of a movie title is about 15 characters (including spaces). For a 15-character title, the mathematics suggests that using the Netflix interface requires about 10 fewer key presses (on average) than the Amazon Video interface.
If you wonder why I did not include the Hulu interface in this comparison, it is because the Hulu “keyboard” is a 1 x 28 grid that contains all letters and the space and backspace keys. Theory predicts an average of 9.32 key presses per character, which is almost twice as many key presses as for the Netflix interface.
You might wonder how well this theoretical model matches reality. Movie titles are not a jumble of random letters! How do the Netflix and Amazon Video interfaces compare when they are used to type actual movie titles?
To test this question, I downloaded the titles of 1,000 highly-rated movies. I wrote
a SAS program that calculates the number of the arrow keys that are needed to type each movie title
for each interface. This section summarizes the results.
The expression “press the arrow key,” is a bit long, so I will abbreviate it as “keypress” (one word). The “number of times that you need to press the arrow keys to specify a movie title” is similarly condensed to “the number of keypresses.”
For these 1,000 movie titles, the Netflix interface requires an average of 50.9 keypresses per title or 3.32 keypresses per character.
the Amazon Video interface requires an average of 61.4 keypresses per title or 4.01 keypresses per character. Thus, on average, the Netflix interface requires 10.56 fewer keypresses per title, which closely agrees with the
mathematical prediction that consider only the shape of the keyboard interface. A paired t test indicates that the difference between the means is statistically significant.
The difference between medians is similar: 45 for the Netflix interface and 56 for the Amazon interface.
The following comparative histogram (click to enlarge) shows the distribution of the number of keypresses for each of the 1,000 movie titles for each interface. The upper histogram shows that most titles require between 30 and 80 keypresses in the Amazon interface, with a few requiring more than 140 keypresses. In contrast, the lower histogram indicates that most titles require between 20 and 60 keypresses in the Netflix interface; relatively fewer titles require more than 140 keypresses.
You can also use a scatter plot to compare the number of keypresses that are required for each interface. Each marker in the following scatter plot shows the number of keypresses for a title in the Amazon interface (horizontal axis) and the Netflix interface (vertical axis). Markers that are below and to the right of the diagonal (45-degree) line are titles for which the Netflix interface requires fewer keypresses. Markers that are above and to the left of the diagonal line are titles for which the Amazon interface is more efficient. You can see that most markers are below the diagonal line. In fact, 804 titles require fewer keypresses in the Netflix interface, only 177 favor the Amazon interface, and 19 require the same number of keypresses in both interfaces. Clearly, the Netflix layout of the virtual keyboard is more efficient for specifying movie titles.
The scatter plot and histograms reveal that there are a few movies whose titles require many keypresses. Here is a list of the 10 titles that require the most keypresses when using the Amazon interface:
Most of the titles are long. However, one (4 Months, 3 Weeks and 2 Days) is not overly long but instead requires shifting back and forth between the two keyboards in the Amazon interface. That results in a large number of keypresses in the Amazon interface (178) and a large difference between the keypresses required by each interface. In fact, the absolute difference for that title (75) is the largest difference among the 1,000 titles.
You can also look at the movie titles that require few keypresses. The following table shows titles that require fewer than 10 keypresses in either interface. The titles that require the fewest keypresses in the Netflix interface are M, Moon, PK, and Up. The titles that require the fewest keypresses in the Amazon interface are Saw, M, Creed, and Up. It is interesting that Saw, which has three letters, requires fewer keypresses than M, which has one letter. That is because the S, A, and W letters are all located in the upper left of the QWERTY keyboard whereas the letter M is in the lower left corner of the keyboard. (Recall that the cursor begins on the Q letter in the upper left corner.).
In summary, both Netflix and Amazon Video provide an interface that enables customers to select movie titles by using the four arrow keys on a TV remote control.
The Netflix interface is a 7 x 6 grid of letters; the Amazon interface is a 3 x 11 QWERTY keyboard and a separate keyboard for numbers.
In practice, both interfaces display partial matches and you only need to type a few characters. However, it is interesting to statistically compare the interfaces in terms of efficiency.
For a set of 1,000 movie titles,
the Netflix interface requires, on average, 10.6 fewer keypresses than the Amazon interface to completely type the titles. This article also lists the movie titles that require the most and the fewest number of key presses.
If you would like to duplicate or extend this analysis, you can
download the SAS program that contains the data.
The post Two interfaces for typing text by using a TV remote control appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Here in North Carolina (NC), we’re pretty much resigned to the fact that many of the hurricanes in the Atlantic Ocean are going to visit us. NC sticks out farther into the ocean than most of our neighboring states, and that just makes a tempting target for the hurricanes. But […]
The post Hurricane Florence – and other category 4 hurricanes that hit NC appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This blog post introduces the use of deep learning to train a deep neural network to further improve performance; and hybrid architectures.
The post Speeding Up Your Analytics with Machine Learning (Part 2) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Distances on rectangular grids appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Given a rectangular grid with unit spacing, what is the expected distance between two random vertices, where distance is measured in the L_{1} metric?
(Here “random” means “uniformly at random.”)
I recently needed this answer for some small grids, such as the one to the right, which is a 7 x 6 grid.
The graph shows that the L_{1} distance between the points (2,6) and (5,2) is 7, the length of the shortest path that connects the two points. The L_{1} metric is sometimes called the “city block” or “taxicab” metric because it measures the distance along the grid instead of “as the bird flies,” which is the Euclidean distance.
The answer to the analogous question for the continuous case (a solid rectangle) is difficult to compute.
The main result is that
the expected distance is less than half of the diameter of the rectangle. In particular, among all rectangles of a given area, a square is the rectangle that minimizes the expected distance between random points.
Although I don’t know a formula for the expected distance on a discrete regular grid, the grids in my application were fairly small so this article shows how to compute all pairwise distances and explicitly find the average (expected) value. The DISTANCE function in SAS/IML makes the computation simple because it supports the L_{1} metric. It is also simple to perform computer experiments to show that among all grids that have N*M vertices, the grid that is closest to being square minimizes the expected L_{1} distance.
An N x M grid contains NM vertices. Therefore the matrix of pairwise distances is NM x NM. Without loss of generality, assume that the vertices have X coordinates between 1 and N and Y coordinates between 1 and M. Then the following SAS/IML function defines the distance matrix for vertices on an N x M grid. To demonstrate the computation, the distance matrix for a 4 x 4 grid is shown, along with the average distance between the 16 vertices:
proc iml; start DistMat(rows, cols); s = expandgrid(1:rows, 1:cols); /* create (x, y) ordered pairs */ return distance(s, "CityBlock"); /* pairwise L1 distance matrix */ finish; /* test function on a 4 x 4 grid */ D = DistMat(4, 4); AvgDist = mean(colvec(D)); print D[L="L1 Distance Matrix for 4x4 Grid"]; print AvgDist[L="Avg Distance on 4x4 Grid"]; |
For an N x M grid, the L_{1} diameter of the grid is the L_{1} distance between opposite corners. That distance is always (N-1)+(M-1), which equates to 6 units for a 4 x 4 grid.
As for the continuous case, the expected L_{1} distance is less than half the diameter. In this case, E(L_{1} distance) = 2.5.
As indicated previously, the expected distance between two random vertices on a grid depends on the aspect ratio of the grid. A grid that is nearly square has a smaller expected distance than a short-and-wide grid that contains the same number of vertices. You can illustrate this fact by computing the distance matrices for grids that each contain 36 vertices. The following computation computes the distances for five grids: a 1 x 36 grid,
a 2 x 18 grid,
a 3 x 12 grid,
a 4 x 9 grid, and
a 6 x 6 grid.
/* average L1 distance on 36 x 36 grid in several configurations */ N=36; rows = {1, 2, 3, 4, 6}; cols = N / rows; AvgDist = j(nrow(rows), 1, .); do i = 1 to nrow(rows); D = DistMat(rows[i], cols[i]); AvgDist[i] = mean(colvec(D)); end; /* show average distance as a decimal and as a fraction */ numer = AvgDist*108; AvgDistFract = char(numer) + " / 108"; print rows cols AvgDist AvgDistFract; |
The table shows that short-and-wide tables have an average distance that is much greater than a nearly square grid that contains the same number of vertices.
When the points are arranged in a 6 x 6 grid, the distance matrix naturally decomposes into a block matrix of 6 x 6 symmetric blocks, where each block corresponds to a row.
When the points are arranged in a 3 x 12 grid, the distance matrix decomposes into a block matrix of 12 x 12 blocks
The following heat maps visualize the patterns in the distance matrices:
D = DistMat(6, 6); call heatmapcont(D) title="L1 Distance Between Vertices in a 6 x 6 Grid"; D = DistMat(3, 12); call heatmapcont(D) title="L1 Distance Between Vertices in a 3 x 12 Grid"; |
You can demonstrate how the average distance depends on the number of rows by choosing the number of vertices to be a highly composite number such as 5,040. A highly composite number has many factors. The following computation computes the average distance between points on 28 grids that each contain 5,040 vertices. A line chart then displays the average distance as a function of the number of rows in the grid:
N = 5040; /* 5040 is a highly composite number */ rows = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 18, 20, 21, 24, 28, 30, 35, 36, 40, 42, 45, 48, 56, 60, 63, 70}; cols = N / rows; AvgDist = j(nrow(rows), 1, .); do i = 1 to nrow(rows); D = DistMat(rows[i], cols[i]); AvgDist[i] = mean(colvec(D)); end; title "Average L1 Distance Between 5040 Objects in a Regular Grid"; call series(rows, AvgDist) grid={x y} other="yaxis offsetmin=0 grid;"; |
Each point on the graph represents the average distance for an
N x M grid where NM = 5,040. The horizontal axis displays the value of N (rows).
The graph shows that nearly square grids (rows greater than 60) have a much lower average distance than very short and wide grids (rows less than 10). The scale of the graph makes it seem like there is very little difference between the average distance in a grid with 40 versus 70 rows, but that is somewhat of an illusion.
A 40 x 126 grid (aspect ratio = 3.15) has an average distance of 55.3; a
70 x 72 grid (aspect ratio = 1.03) has an average distance of 47.3.
In summary, you can use the DISTANCE function in SAS/IML to explicitly compute the expected L_{1} distance (the “city block” distance) between random points on a regular grid. You can minimize the average pairwise distance by making the grid as square as possible.
City planning provides a real-world application of the L_{1} distance. If you are tasked with designing a city with N buildings along a grid, then the average distance between buildings is smallest when the grid is square.
Of course, in practice, some buildings have a higher probability of being visited than others (such as a school or a grocery). You should position those buildings in the center of town to shorten the average distance that people travel.
The post Distances on rectangular grids appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
I usually create very technical graphs, that just focus on conveying the information in a concise and straightforward manner (no wasted colors, and nothing to distract you from the data). But sometimes, depending on your audience and the purpose of the graph, you might need to create a graph that […]
The post Creating industry-specific infographics (eg, 3d printing) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |