Are you thinking about moving data (e.g., patient files, images, telemedicine survey data) from that clunky old Yugo computer or storage system to a brand new shiny Ferrari system? Are those happy thoughts or are they clouded by dread and visions of digital monsters eating up all your data and laughing in your face? You are not alone.
Data migration can be painful but there are ways to lessen that pain with careful preparation. First let’s start with a few basic terms. Legacy data is what you have now and want to transfer. Data migration is the process of importing that data into the new system. Data cleansing is getting the old data ready to move – making it compatible with what the new system requires or expects. It’s this last part that is often the most difficult.
One of the first steps in this process is to decide on what your goals are. Here are some questions you should be thinking about.
- What type of data do you have & what are the uses? (volume vs value)
- Should you bring the old data over? All? Some? What are the decision criteria? This is generally a volume vs value question
- Is the desired data extractable from existing database?
- Are the desired fields importable?
- Does data migration from legacy included purchase new system?
- If not do you have expertise in-house or should you hire out?
- Will the process be automated or manual?
- How much to spend - $ & time?
- Is this short term or long term?
- Are there any legal issues?
- Does the data already exist anywhere else (EMR)?
Next consider what type of data you have: spreadsheet entries, images, text (pdf, doc, etc.), dedicated program formatted. You should also consider whether the data adhere to some sort of standard in its present form or will it have to in the future (for example SQL). The problem is that there are lots of data formats and some databases contain multiple formats (e.g., CSV, Dbase2, DIF, Excel, HTML, SQL, SYLK, Text, WKS, Windows metafiles) increasing the complexity of the transfer. Thus it is key to start out knowing what you actually have before figuring out what can be done with it.
What are your options? Well the first one certainly has some appeal – avoid it! Simply leave the old data where it is under the expectation it will likely not be needed or needed only rarely and thus it can stay where it is until that system dies and hope it’s never needed again. Attractive but not always feasible. The next option is to work incrementally. Maybe start moving the most recent (thus most likely to be requested) data over then progressively work back in time until everything has been transitioned over. This can take a while but if time is not that pressing of an issue it has its benefits. Finally there is the option to use existing tools or services – there are companies that do this! As with any service you need to shop around, get the best prices, guarantees, verify the time frame, what happens if data are lost, etc. If you have the resources this is often your best and easiest option.
What are some of the more common problems people encounter?
- Single columns/rows used for more than one purpose or data point
- Column values are determined based on data in other columns (i.e., formulas are used)
- There are missing data
- There are missing or too many rows and/or columns
- Data values change (e.g., hours vs minutes in the same column)
- Data format differs (e.g., dates are 10-1-15 vs Oct 1, 2015 vs January 10, 2015)
- There are relationships between data in different columns that are not explicit
- Special characters are used and/or used inconsistently
- Default values differ
- Hidden data fields
All of these typically can be solved in the data cleansing phase, but they need to be identified before they can be dealt and this is the difficult and often tedious part. The problem is often confounded by the fact that a lot of older databases were set up and used by one or two people but they are often no longer available or have themselves forgotten all the rules and shortcuts they developed!
There is an interesting option for researchers called REDCap (Research Electronic Data Capture) that is available at http://project-redcap.org/. This is a mature, secure web-based tool for building and managing online surveys and databases for research studies. It is available free of charge but is not open-source. You can join the consortium to get the source code with a valid end-user license agreement through Vanderbilt University. It may be necessary to contact your Tech Transfer or General Counsel Office but it is a fairly straightforward process. REDCap can be installed on a variety of environments (e.g., Mac, Windows, Linux, Unix) and is compliant with HIPAA, 21 CFR Part II, FISMA and other international standards.
So don’t let the data migration demons haunt you! It can be challenging but if you utilize available resources, be prepared to invest the necessary preparation time and resources (yes financial too), strategize, plan, and think about how you will ensure that all the data you wanted to migrate made it over (i.e., validation) half of the challenge is already met.