Research Methods Resources

step 5 in a research project

Home

Research Methods Resources

Handling data

Quick links to some resources on this page

Research data management training course (pdf 1,002 Kb)

Data management (chapter 4.6 of The Green Book) (pdf 328 Kb)

The best backup strategy for your research data (pdf 2,605 Kb)

The Electronic logbook

The two faces of MS Excel

MS Excel problems by David A. Heiser

MS Excel for Statistics from SSC

SSC-Stat add-in

Instat+ software including tutorials

Data management

Data management is an  often overlooked aspect of research work and involves more than you would think of at first sight. Inappropriate data management can lead to a lot of problems during your research project, the least of it being a loss of time. Rather than trying to give some static definitions, you can get some ideas on the importance of data and data management by clicking on following link.

What are data?

 

Many researchers use MS Excel for data entry. We first show a selection of data management problems we encountered.

Common problems using an MS Excel spreadsheet

 

All those and many other problems can be solved by entering data using a 'long list' format. This way you can use the full power of MS Excel for data exploration, data validation and error checking.

Data entry and validation in an MS Excel spreadsheet

 

Moving on towards relational databases

Simple datasets are easy to manage with a spreadsheet, but many interrelated data are not. They require creating multiple copies of the same data to produce the intended result in an intuitive way. The separate copies are difficult to keep consistent for an active dataset. The appropriate way of using spreadsheets that is efficient and less error prone is not intuitive at all. It requires deep understanding of advanced functions that are very spreadsheet-specific and a lot of self-discipline in avoiding shortcuts. Data managed as relational databases, on the other hand, are more likely to produce accurate results because they allow integrity rules to be enforced and are efficiently organized for querying when the volume is large and the correct relationships between them are well defined and implemented.

You find more information on this in session 4, session 5 and session 6 of the Research Data Management course. In session 4 you are asked to perform six tasks so you can experience the limitations of spreadsheets. Session 5 introduces you to MS Access as an example of a database management system to overcome the limitations of spreadsheets. Session 6 introduces you to hierarchical data structures and methods to help preserve the integrity of your data. You will see how a well-designed database allows you for easier querying.

 

 

Linking or importing files between different software packages

In session 4 of the Research Data Management course, the advantages of the use of a database became clear. A spreadsheet can be very useful but has some limitations. If you have data in an MS Excel spreadsheet and you want to do more complicated things with them, you can easily import the named range into a database or into analysis software if you followed the above-mentioned rules. A danger of working on the same data with several software packages however is that you end up with several copies of the same dataset. If, at a given moment some changes occur in one of the copies, it can be quite confusing to distinguish between the modified and original copies. So, as a general rule, avoid importing data but try to link the different software packages.

Following technical note covers in detail how to import and how to link data between different software packages. It is a ‘how to’ that can be used in the compilation, querying and analysis phase of the project life cycle (see the research data management course notes).

Buysse, Wim. 2003. The Dupe of Duplication. ICRAF Research Support Unit Technical Note No 2. World Agroforestry Centre (ICRAF), Nairobi, Kenya. 50 pp.

 

Dupe of duplication.pdf  (7,591 KB)

 

A personal backup strategy

There exist much literature on perceptions of risk. Most people will overestimate the chance of relatively rare disasters such as earthquakes and plane crashes. They underestimate the chance of common problems. In particular they underestimate the chance of their computer crashing, a probability that approaches 1. Power fluctuations and dirt can damage the hard disk, and the read/write mechanism can just wear out. In addition, the disk can get too fragmented, slow down and this and several other reasons leads to corrupted files, lost files, inaccessible hard disk or worse. On top of that there are different categories of human errors: loss of a laptop, deleting files, deleting parts of files. It can also happen on purpose by unscrupulous persons, jealous colleagues or disgruntled employees: theft of laptops, deleting files or damaging computers on purpose, … And then of course there is always a chance of fire, water damage, … Last but not least there is an ever growing group of “malware”: computer viruses, Trojan horses, backdoors, spyware, … 

Following technical note shows you how to develop a personal backup strategy for your research data. We advise you to have your own personal strategy in addition to the backup strategy of your organisation. 

The goal of this note is not to protect you from a complete system failure. This is the quite complex job of the better skilled system administrator. The goal of this note is to protect you from loss of data from your current research activities. In a way, the goal of this note is to protect you from yourself since it involves some discipline. The only way to avoid unrecoverable data loss is to backup regularly in an organised way.

 

Buysse, Wim and Maina, Paul. 2005. The best backup strategy for your research data. ICRAF Research Support Unit Technical Note No 1a. ICRAF World Agroforestry Centre Nairobi, Kenya. 49 pp.

 

Best backup strategy for your research data.pdf  (2,605 KB)

A data management strategy for your organisation

Until now we have focused on data management skills and tools for individual researchers. It is however necessary to develop a data management strategy for projects or whole institutes. Such strategy includes a description of all the steps in data collection, entry and processing and storage. It should indicate why, where, how and by whom each component is executed. It must be relevant to the objectives and constraints of the scientists, project and institute. At the same time it must meet the requirements of ensuring data quality, maintaining its long-term value and allowing efficient processing. The strategy also needs an implementation plan. It will normally make sense to start at the lowest level, improving the management of data from individual studies. A strategy needs commitment from staff at all levels of the organization. Managers as well as technicians need to understand the importance of, and the benefit gained, from good data management. Resources in terms of time and money must be allocated to these tasks and it is often the managers who have control over financial budgets and staff workloads.

Session 7 of the Research Data Management course and its supporting documentation give an introduction to the topic.

Advanced data management tools

Click here to learn more about The Electronic Logbook, a data management tool designed to help you organise research data across projects and sites.

Choosing statistical software

 

"Are we there yet?" 

So you have a solid research proposal, have critically reviewed relevant literature, have a set of clear study objectives and hypotheses and have organised your data in a spreadsheet or database and checked for errors. 

Researchers and students not only want to jump immediately to the statistical analysis of their data, they also expect too much from the statistical software.

We first show you what you can, cannot and should not do with MS Excel before giving our view on when to move to and how to use statistical software.

Go to The two faces of MS Excel

 

What is the best statistical software? Which software should I use because I work on biotechnology, not on classical agricultural experiments?

These are two examples of questions we often hear. We answer those questions by asking you a question.

 

This is the age of 3 randomly selected male staff members of the ICRAF-ILRI Research Methods Group. They are from the same ethnic and geographical background. Which statistical test do we need to prove which person is the oldest?

Click here for the answer.

A strategy for choosing statistical software

The rest of this page is still under construction.

External links

 

Recommended literature

 

Teaching

Training courses and university lectures on analysing data often jump immediately to the phase of  statistical analysis. Data management is often not taught.

Sometimes, a course in statistical analysis is actually a training in using a specific software package using standard datasets.

We believe it is essential to start with thorough mastering of data management before moving on to statistical analysis. In real life, data management usually takes much more time than the statistical analysis and the quality of statistical analysis depends heavily on descriptive statistics and visual exploration of the data. This first phase usually gives you already an answer. The phase of the formal statistical analysis is needed to just confirm what you already know by exploring your data and to add measures of precision.

It is absolutely necessary to give hands-on training on computers using real datasets so students are prepared to work with real datasets. The small and cleaned up datasets you find in old textbooks do not exist in reality.

Statistical software is a tool. You use it by clicking on a button or writing a command. You have to know which button or which command. You find this in manuals, books, websites. If you know this for one software package, it becomes easier to use another one. It is more important to understand concepts and principles and to place your statistical analysis within the context of the whole research process.

While it might be helpful to give an introduction on how to use a specific software, you can only really learn a software by using it and trying out new things.

Researchers or students should not rely on training courses to learn a software but spent enough time trying it out themselves in their own time. Start with something that is user-friendly and free. On this CD-rom you find GenStat Discovery Edition and Instat+, but use the software you can afford, you are comfortable with and that easily does the kind of statistical analysis you need. Focus on understanding concepts and principles. 

Our advise to students or young researchers is to invest time in learning R as a second software package so you will be ready to move to R when it becomes the standard and you will be able to develop your own solutions for analysing complex datasets using complex methods that are only available in expensive commercial software.

 

Home

Research Methods Resources

 

 

GenStat Discovery Edition