Research Methods Resources

Using MS Excel for statistics

Home

Research Methods Resources

Back to Handling data

The two faces of MS Excel

Although many statisticians look down upon MS Excel, it has many advantages. There are very many users (an estimated 440 million users in 2003) and the majority of them are comfortable with it. This means there exist a lot of books and websites on using MS Excel, so you're very likely to find a solution for any problem you might encounter. 

The picture shows just a small random selection of books on MS Excel. Especially "Special Edition Using Microsoft Excel 2000" By Patrick Blattner et al is a worthwhile investment. It covers almost anything from basic to advanced use. It is published by QUE corporation. There is a low price reprint available from Prentice-Hall of India that is for sale on the Indian subcontinent and in bookshops in some African capitals. The latest edition covers MS Excel 2003. Click here to find it on the website of QUE Publishing or click here to find the Eastern Economy Edition on the website of Prentice-Hall of India Pvt. Ltd.

If you don't find the solution for your MS Excel problem in a book, you'll find it on the Internet. Use your Internet search skills to find an answer to your question. A site where you are likely to find any answer is the site of Microsoft's MVPs “Most Valuable Professionals” http://www.mvps.org/links.html#Excel 

Another advantage of MS Excel is that you can extend and customize it using macros and VBA scripts. Also the MVP site is a good starting point to learn more about these possibilities.

MS Excel is a fast and easy tool for:

  • descriptive statistics

  • data exploration

  • error checking and data validation

Especially the pivot tables and pivot charts are extremely powerful tools. The main drawback is the fact that you can do almost anything in MS Excel. You have to use it with discipline. Below we give 3 rules.

 

Disciplined use of MS Excel for research data management

The question is not whether or not to use MS Excel. The question is when and how.

A disciplined use of MS Excel leads to the perfect spread sheet.

Rule 1: ensure proper data management at folder, file and worksheet level to avoid multiple copies and loss of data.

As mentioned before, data entry on a computer can be equal to total data loss. You can find more information on proper data management practices in:

Rule 2: ensure proper spreadsheet layout to be able to make real use of MS Excel (or of any computer)

The first example here is a typical example of a nice spreadsheet that contains a lot of of data. However it is almost impossible to use such a spreadsheet for making summaries and further calculations.

You can find more information on proper spreadsheet layout in the section on Designing an MS Excel spreadsheet for research data.

Rule 3: keep graphs simple or you loose the comparative advantages of using an extra brain area

Data graphics should draw the viewer’s attention to the sense and substance of the data, not to something else. The data graphical form should present the quantitative contents. Occasionally artfulness of design makes a graphic worthy of the Museum of Modern Art, but essentially statistical graphics are instruments to help people reason about quantitative information.” - p. 91

Edward R. Tufte. 1983. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut. 197 pp.

For more information on the work of Edward Tufte, visit http://www.edwardtufte.com/tufte/ 

For a very long time some statisticians have been interested in graphical display of data. In recent years neuroscience research has proven their point. Humans think and remember most efficiently in a visual way. This is because when you look at a graph or you visualise something, many more brain regions are activated than when you look for instance at a table.

Comparing numbers in tables is hard work. It involves the rather limited working memory. If you want to compare two numbers across two columns that are in a different row, you have to read the information in a cell in column 1, remember the information from that cell, remember the location in the column, move to column 2 and read, remember the information and remember the location, return to column 1 to compare, ....

Graphs are better for perceiving trends, making comparisons and making predictions. Tables are best for looking up specific information. Most of the phase of data exploration is about trying to discover patterns and trends and about comparing the order of magnitude of for instance treatments. So much of the data exploration work involves making graphs.

However, when you create graphs that are full of color, 3d effects and without clear order, you loose all the comparative advantage of using extra brain areas. The following two graphs (copied from figure 3 of Wilkinson) illustrate the point. In the scatterplot on the left you see immediately a descending trend and spot the outlier of 1 partner per year. In the 3D pie chart on the right, you see a lot of green and some red. It takes some effort to see that red equals 0 and green equals 1 and much more effort to compare 2, 3 and 4.

 

A good starting point to design effective and powerful graphs is Cleveland's hierarchy of graphical features. We can judge the position along a common scale more accurately than the position along nonaligned scales. This in turn we can judge more accurate than length. And so on.  

This figure was copied from Wilkinson. His paper Presentation graphics gives an overview of the history of presentation graphics, principles of usage, and applications. It was copied from the home page of prof. Leland Wilkinson.  It should be cited as: Wilkinson, Leland. Presentation graphics. 13 pp. Accessed on-line at http://www.spss.com/research/wilkinson/Publications/iesbs.pdf

It refers to: Cleveland, W. S. (1994). The Elements of Graphing Data (Rev. Ed.). Summit, NJ: Hobart Press.

Creating graphs in MS Excel is very simple. And it has the very powerful Pivot chart option that allows you to quickly drag and drop variables in a chart. However, the default graphs of MS Excel are full of clutter and chartjunk that they become quite useless for easily discovering pattern and trends. You have to do some work before they become more useful.

Click on the picture to enlarge it in a new window.

An example. 

The aim of this graph is to show probability as a long run frequency. While the relative frequency per month fluctuates, the cumulative relative frequency stabilises in the long run near a value of 0.79.

In this default MS Excel graph, you see that the purple line stabilises but there is a lot of distraction.

The data in the example are taken from McColl p. 16 (John H McColl. 1995. Probability. Modular Mathematics Series. Edward Arnold; London, Sydney, Auckland. 182 pp. ISBN 0 340 61426 9)

 

You have to use several of the Chart menu options to format the axes, clear the gridlines, format the plot area to get rid of the gray background and of course add meaningful titles.

The same graph after cleaning up. It is easier to understand the trend.

Click on the picture to enlarge it in a new window.

More resources on creating good graphs and tables in MS Excel:

Statistical Services Centre. 2001. Guidelines for good statistical graphics in Excel. Statistical Services Centre, The University of Reading. 15 pp.

xlg.pdf

Statistical Services Centre. 2002. Good tables for Excel users. Statistical Services Centre, The University of Reading. 15 pp.

gtx.pdf

Important dangers when using MS Excel for statistical analysis.

So far, so good. As long as you use MS Excel in a disciplined way it is a perfect tool for your research project in terms of:

  • descriptive statistics

  • data exploration

  • error checking

MS Excel has also an inbuilt analysis tool pack add-in (luckily not installed by default). And here is where the trouble starts. Probably under influence from business and accountancy courses and sometimes due to the fact that funds for statistical software are lacking, quite some lecturers use MS Excel to teach statistics to students who in later life continue to analyse their data using MS Excel.

 

You have to be aware that MS Excel has very limited statistical tools:

  • very limited ANOVA possibilities

  • cannot handle missing data

  • regression through the origin gives wrong results

  • cannot analyse subsets

  • limited output on the analysis

  • very limited help on statistical functions which is often confusing and sometimes incorrect

  • you often get stuck when you use a worksheet that was created in one language version in another one. For instance in the English MS Excel you create a random number with the function =RAND(); in the French MS Excel you use =ALEA() (because of course the translation of random is aléatoire). Translating the functions was a good move towards understanding what you are doing for  researchers in many non-English speaking countries but is problematic if they collaborate with researchers from English speaking countries.

  • it displays a maximum of 15 decimal digits

  • in older versions the accuracy of VAR and STDEV which are internally used in many other Excel functions and routines was a problem. 

While problems with the accuracy of VAR and STDEV have been solved in MS Excel 2003; they still give wrong results when dealing with a small variation in a dataset of large numbers.

Click on the picture to enlarge it in a new window.

Some claim that you should not do any statistical analysis with MS Excel. A lot of criticism however is based on older versions and on the default display of the spreadsheet.

A very detailed and comprehensive study on faults and problems in MS Excel 2000 and 2003 can be found on the website of David A. Heiser.

Go to the local copy of "Microsoft Excel 2000 and 2003 Faults, Problems, Workarounds and Fixes", a website of David A. Heiser. 

"Microsoft Excel 2000 and 2003 Faults, Problems, Workarounds and Fixes" was copied (on 23 May 2006) with permission from the website of David A. Heiser. The source URL is http://www.daheiser.info/  You can further copy and use the articles as long as you refer to the original source URL and give credit to the author. The author also requests to email him about any other errors and faults in MS Excel that he didn't mention. You can find his email address on his website.

 

Following copy of the Excel for Statistics page of the Statistical Services Centre of the University of Reading contains many resources, including a macro writing tutorial and tips and warnings when using Excel for statistics. The source URL is http://www.ssc.rdg.ac.uk/software/excel/home.html 

 

So what now? 

One solution is to keep on teaching statistics using MS Excel but being aware of its limitations. The webpage of ASSUME (The Association of Statistics Specialists Using Microsoft Excel) gives links to many useful resources.

 

Or you can use a third party add-in to overcome the limitations of MS Excel. An example is SSC-Stat from the Statistical Services Centre of the University of Reading. The full version of SSC-Stat may be downloaded and used for non-commercial purposes by any individual free of charge. There is no copy-protection, time limit or special data size restriction. It must be licensed for multiple, government and commercial use.

Go to the SSC-Stat page

 

While using Excel to teach statistics might be useful for teaching in secondary education or early undergraduate years at university, we would encourage lecturers to introduce students to professional statistical software. There are several professional statistical software packages available for free such as Instat, GenStat Discovery Edition and R. The first two are available on this CD-rom with many resources that can be used for teaching.

Students should become fluent in good data management practices using MS Excel (see the disciplined use above) but should be taught how to move beyond the limitations of a spreadsheet package. Introducing students at university to professional statistical software will give them a head start in their professional careers.

Home

Research Methods Resources

 

 

GenStat Discovery Edition