Clean Messy and Missing Data in Tables - MATLAB & Simulink (2024)

Open Live Script

This example shows how to clean and reorganize a table that has messy and missing data values. First, you can identify missing data by using the Import Tool or by using functions such as the summary and ismissing functions. You can standardize, fill, or remove missing values by using the standardizeMissing, fillmissing, or rmmissing functions. Then you can clean your table further by reorganizing it. You can rearrange table rows and variables by using functions such as the sortrows and movevars functions.

Examine Data in File

Examine the data in the sample comma-separated value (CSV) file, messy.csv, by using the Import Tool. The tool previews the data and enables you to specify how to import the data. To examine messy.csv in the Import Tool, after opening this example in MATLAB®, go to the Home tab, click Import Data in the Variable section, and open messy.csv using the file selection dialog box.

The Import Tool shows that messy.csv has five columns with text and numeric values.

Clean Messy and Missing Data in Tables- MATLAB & Simulink (1)

The file contains many different missing data indicators:

  • Empty text

  • Period (.)

  • NA

  • NaN

  • -99

The Import Tool automatically recognizes (but does not visually highlight) some missing data indicators, such as NaN in numeric columns and empty text in text columns.

The tool highlights other indicators, such as the empty text, period, and NA, that occur in column B. These values are not standard missing values. However, nonnumeric values in a numeric column are likely to represent missing values. When you import the data, you can specify that these values should also be treated as missing values.

When numeric data otherwise consists of positive values, a single negative value, such as -99, can be a flag for missing data. If a number such as -99 represents missing data in your table, then you must specify that it is a missing value when you clean your table.

Import Data as Table

You can import data into the MATLAB workspace from the Import Tool. You can also use the readtable function to read the data from a file and import it as a table.

Import the data in messy.csv using the readtable function. To read text data into table variables that are string arrays, use the TextType name-value argument. To treat specified nonnumeric values in numeric columns as missing values, use the TreatAsMissing name-value argument. For columns A and C with text data, readtable imports any empty text as missing strings, which display as <missing>. For columns B, D, and E with numeric data, readtable imports empty text as NaN values, and also imports . and NA as NaN values when you specify them using TreatAsMissing. However, the values that are -99 remain unchanged because they are numeric.

messyTable = readtable("messy.csv","TextType","string","TreatAsMissing",[".","NA"])
messyTable=21×5 table A B C D E ______ ____ _________ ____ ____ "afe1" NaN "yes" 3 3 "egh3" NaN "no" 7 7 "wth4" -99 "yes" -99 -99 "atn2" 23.7 <missing> NaN 23.7 "arg1" NaN "yes" 5 NaN "jre3" 34.6 "yes" 34.6 34.6 "wen9" 234 "yes" 234 234 "ple2" 2 "no" 2 2 "dbo8" 5 "no" 5 5 "oii4" 5 "yes" 5 5 "wnk3" 245 "yes" 245 245 "abk6" 563 "no" 563 563 "pnj5" 463 "no" 463 463 "wnn3" 6 "no" 6 6 "oks9" 23 "yes" 23 23 "wba3" 14 "yes" 14 14 ⋮

View Summary of Table

To view a summary of the table, use the summary function. The summary shows the data type and other descriptive statistics for each table variable. For example, summary shows the number of missing values in each numeric variable of messyTable.

summary(messyTable)
Variables: A: 21x1 string B: 21x1 double Values: Min -99 Median 22.5 Max 563 NumMissing 3 C: 21x1 string D: 21x1 double Values: Min -99 Median 14 Max 563 NumMissing 2 E: 21x1 double Values: Min -99 Median 21.5 Max 563 NumMissing 1 

Find Rows with Missing Values

To find the rows of messyTable that have at least one missing value, use the ismissing function. If you have nonstandard missing values in your data, such as -99, you can specify it along with the standard missing values.

The output of ismissing is a logical array that identifies the elements of messyTable that have missing values.

missingElements = ismissing(messyTable,{string(missing),NaN,-99})
missingElements = 21x5 logical array 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⋮

To create a logical vector that identifies rows that have missing values, use the any function.

rowsWithMissingValues = any(missingElements,2)
rowsWithMissingValues = 21x1 logical array 1 1 1 1 1 0 0 0 0 0 ⋮

To index into the table and return only the rows that have missing values, use the logical vector rowsWithMissingValues.

missingValuesTable = messyTable(rowsWithMissingValues,:)
missingValuesTable=6×5 table A B C D E ______ ____ _________ ___ ____ "afe1" NaN "yes" 3 3 "egh3" NaN "no" 7 7 "wth4" -99 "yes" -99 -99 "atn2" 23.7 <missing> NaN 23.7 "arg1" NaN "yes" 5 NaN "gry5" 21 "yes" NaN 21

Fill Missing Values

One strategy for cleaning the missing values in a table is to replace them with more meaningful values. You can replace nonstandard missing values by inserting standard missing values. Then you can fill missing values with adjusted values. For example, you can fill missing values with their nearest neighbors or with the mean value of a table variable.

In this example, -99 is a nonstandard value for indicating a missing value. To replace the instances of -99 with standard missing values, use the standardizeMissing function. NaN is the standard missing value for single- and double-precision floating-point numeric arrays.

messyTable = standardizeMissing(messyTable,-99)
messyTable=21×5 table A B C D E ______ ____ _________ ____ ____ "afe1" NaN "yes" 3 3 "egh3" NaN "no" 7 7 "wth4" NaN "yes" NaN NaN "atn2" 23.7 <missing> NaN 23.7 "arg1" NaN "yes" 5 NaN "jre3" 34.6 "yes" 34.6 34.6 "wen9" 234 "yes" 234 234 "ple2" 2 "no" 2 2 "dbo8" 5 "no" 5 5 "oii4" 5 "yes" 5 5 "wnk3" 245 "yes" 245 245 "abk6" 563 "no" 563 563 "pnj5" 463 "no" 463 463 "wnn3" 6 "no" 6 6 "oks9" 23 "yes" 23 23 "wba3" 14 "yes" 14 14 ⋮

To fill missing values, use the fillmissing function. The function provides many methods that fill missing values. For example, fill missing values with their nearest neighbors that are not missing values.

filledTable = fillmissing(messyTable,"nearest")
filledTable=21×5 table A B C D E ______ ____ _____ ____ ____ "afe1" 23.7 "yes" 3 3 "egh3" 23.7 "no" 7 7 "wth4" 23.7 "yes" 7 23.7 "atn2" 23.7 "yes" 5 23.7 "arg1" 34.6 "yes" 5 34.6 "jre3" 34.6 "yes" 34.6 34.6 "wen9" 234 "yes" 234 234 "ple2" 2 "no" 2 2 "dbo8" 5 "no" 5 5 "oii4" 5 "yes" 5 5 "wnk3" 245 "yes" 245 245 "abk6" 563 "no" 563 563 "pnj5" 463 "no" 463 463 "wnn3" 6 "no" 6 6 "oks9" 23 "yes" 23 23 "wba3" 14 "yes" 14 14 ⋮

Remove Rows with Missing Values

Another strategy for cleaning the missing values in a table is to remove the rows that have them.

To remove rows that have missing values, use the rmmissing function.

remainingTable = rmmissing(messyTable)
remainingTable=15×5 table A B C D E ______ ____ _____ ____ ____ "jre3" 34.6 "yes" 34.6 34.6 "wen9" 234 "yes" 234 234 "ple2" 2 "no" 2 2 "dbo8" 5 "no" 5 5 "oii4" 5 "yes" 5 5 "wnk3" 245 "yes" 245 245 "abk6" 563 "no" 563 563 "pnj5" 463 "no" 463 463 "wnn3" 6 "no" 6 6 "oks9" 23 "yes" 23 23 "wba3" 14 "yes" 14 14 "pkn4" 2 "no" 2 2 "adw3" 22 "no" 22 22 "poj2" 34.6 "yes" 34.6 34.6 "bas8" 23 "no" 23 23

Sort and Rearrange Table Rows and Variables

Once you have cleaned the missing values in a table, you can organize it in other ways. For example, you can sort the rows of a table by the values in one or more variables.

Sort the rows by the values in the first variable, A.

sortedTable = sortrows(remainingTable)
sortedTable=15×5 table A B C D E ______ ____ _____ ____ ____ "abk6" 563 "no" 563 563 "adw3" 22 "no" 22 22 "bas8" 23 "no" 23 23 "dbo8" 5 "no" 5 5 "jre3" 34.6 "yes" 34.6 34.6 "oii4" 5 "yes" 5 5 "oks9" 23 "yes" 23 23 "pkn4" 2 "no" 2 2 "ple2" 2 "no" 2 2 "pnj5" 463 "no" 463 463 "poj2" 34.6 "yes" 34.6 34.6 "wba3" 14 "yes" 14 14 "wen9" 234 "yes" 234 234 "wnk3" 245 "yes" 245 245 "wnn3" 6 "no" 6 6

Sort the rows in descending order by C, and then sort in ascending order by A.

sortedBy2Vars = sortrows(remainingTable,["C","A"],["descend","ascend"])
sortedBy2Vars=15×5 table A B C D E ______ ____ _____ ____ ____ "jre3" 34.6 "yes" 34.6 34.6 "oii4" 5 "yes" 5 5 "oks9" 23 "yes" 23 23 "poj2" 34.6 "yes" 34.6 34.6 "wba3" 14 "yes" 14 14 "wen9" 234 "yes" 234 234 "wnk3" 245 "yes" 245 245 "abk6" 563 "no" 563 563 "adw3" 22 "no" 22 22 "bas8" 23 "no" 23 23 "dbo8" 5 "no" 5 5 "pkn4" 2 "no" 2 2 "ple2" 2 "no" 2 2 "pnj5" 463 "no" 463 463 "wnn3" 6 "no" 6 6

Sorting by C, the rows are grouped first by "yes", followed by "no". Then sorting by A, the rows are listed alphabetically.

To reorder the table so that A and C are next to each other, use the movevars function.

sortedRowsAndMovedVars = movevars(sortedBy2Vars,"C","After","A")
sortedRowsAndMovedVars=15×5 table A C B D E ______ _____ ____ ____ ____ "jre3" "yes" 34.6 34.6 34.6 "oii4" "yes" 5 5 5 "oks9" "yes" 23 23 23 "poj2" "yes" 34.6 34.6 34.6 "wba3" "yes" 14 14 14 "wen9" "yes" 234 234 234 "wnk3" "yes" 245 245 245 "abk6" "no" 563 563 563 "adw3" "no" 22 22 22 "bas8" "no" 23 23 23 "dbo8" "no" 5 5 5 "pkn4" "no" 2 2 2 "ple2" "no" 2 2 2 "pnj5" "no" 463 463 463 "wnn3" "no" 6 6 6

You can also reorder table variables by indexing. Use smooth parentheses and a variable index that specifies the order of the variables in the output table.

sortedRowsAndMovedVars = sortedBy2Vars(:,["A","C","B","D","E"])
sortedRowsAndMovedVars=15×5 table A C B D E ______ _____ ____ ____ ____ "jre3" "yes" 34.6 34.6 34.6 "oii4" "yes" 5 5 5 "oks9" "yes" 23 23 23 "poj2" "yes" 34.6 34.6 34.6 "wba3" "yes" 14 14 14 "wen9" "yes" 234 234 234 "wnk3" "yes" 245 245 245 "abk6" "no" 563 563 563 "adw3" "no" 22 22 22 "bas8" "no" 23 23 23 "dbo8" "no" 5 5 5 "pkn4" "no" 2 2 2 "ple2" "no" 2 2 2 "pnj5" "no" 463 463 463 "wnn3" "no" 6 6 6

See Also

table | ismissing | standardizeMissing | fillmissing | rmmissing | sortrows | movevars | Import Tool | readtable | summary

Related Topics

  • Missing Data in MATLAB
  • Clean Messy Data and Locate Extrema Using Live Editor Tasks
  • Access Data in Tables
  • Add and Delete Table Rows
  • Add, Delete, and Rearrange Table Variables
  • Rename and Describe Table Variables
  • Summarize or Pivot Data in Tables Using Groups
Clean Messy and Missing Data in Tables
- MATLAB & Simulink (2024)
Top Articles
Latest Posts
Article information

Author: Manual Maggio

Last Updated:

Views: 6259

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.