Big Data fraud detection using multiple medicare data sources

Abstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are indiv...

Full description

Bibliographic Details
Main Authors: Matthew Herland, Taghi M. Khoshgoftaar, Richard A. Bauder
Format: Article
Language:English
Published: SpringerOpen 2018-09-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-018-0138-3
id doaj-3a30ebad615e46278f96366d7aa6ee31
record_format Article
spelling doaj-3a30ebad615e46278f96366d7aa6ee312020-11-25T00:57:30ZengSpringerOpenJournal of Big Data2196-11152018-09-015112110.1186/s40537-018-0138-3Big Data fraud detection using multiple medicare data sourcesMatthew Herland0Taghi M. Khoshgoftaar1Richard A. Bauder2Florida Atlantic UniversityFlorida Atlantic UniversityFlorida Atlantic UniversityAbstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are individuals who commit fraud for nefarious reasons and personal gain, limiting Medicare’s ability to effectively provide for the healthcare needs of the elderly and other qualifying people. To minimize fraudulent activities, the Centers for Medicare and Medicaid Services (CMS) released a number of “Big Data” datasets for different parts of the Medicare program. In this paper, we focus on the detection of Medicare fraud using the following CMS datasets: (1) Medicare Provider Utilization and Payment Data: Physician and Other Supplier (Part B), (2) Medicare Provider Utilization and Payment Data: Part D Prescriber (Part D), and (3) Medicare Provider Utilization and Payment Data: Referring Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS). Additionally, we create a fourth dataset which is a combination of the three primary datasets. We discuss data processing for all four datasets and the mapping of real-world provider fraud labels using the List of Excluded Individuals and Entities (LEIE) from the Office of the Inspector General. Our exploratory analysis on Medicare fraud detection involves building and assessing three learners on each dataset. Based on the Area under the Receiver Operating Characteristic (ROC) Curve performance metric, our results show that the Combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805. Overall, the Combined and Part B datasets produced the best fraud detection performance with no statistical difference between these datasets, over all the learners. Therefore, based on our results and the assumption that there is no way to know within which part of Medicare a physician will commit fraud, we suggest using the Combined dataset for detecting fraudulent behavior when a physician has submitted payments through any or all Medicare parts evaluated in our study.http://link.springer.com/article/10.1186/s40537-018-0138-3Big DataU.S. MedicareLEIEFraud detection
collection DOAJ
language English
format Article
sources DOAJ
author Matthew Herland
Taghi M. Khoshgoftaar
Richard A. Bauder
spellingShingle Matthew Herland
Taghi M. Khoshgoftaar
Richard A. Bauder
Big Data fraud detection using multiple medicare data sources
Journal of Big Data
Big Data
U.S. Medicare
LEIE
Fraud detection
author_facet Matthew Herland
Taghi M. Khoshgoftaar
Richard A. Bauder
author_sort Matthew Herland
title Big Data fraud detection using multiple medicare data sources
title_short Big Data fraud detection using multiple medicare data sources
title_full Big Data fraud detection using multiple medicare data sources
title_fullStr Big Data fraud detection using multiple medicare data sources
title_full_unstemmed Big Data fraud detection using multiple medicare data sources
title_sort big data fraud detection using multiple medicare data sources
publisher SpringerOpen
series Journal of Big Data
issn 2196-1115
publishDate 2018-09-01
description Abstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are individuals who commit fraud for nefarious reasons and personal gain, limiting Medicare’s ability to effectively provide for the healthcare needs of the elderly and other qualifying people. To minimize fraudulent activities, the Centers for Medicare and Medicaid Services (CMS) released a number of “Big Data” datasets for different parts of the Medicare program. In this paper, we focus on the detection of Medicare fraud using the following CMS datasets: (1) Medicare Provider Utilization and Payment Data: Physician and Other Supplier (Part B), (2) Medicare Provider Utilization and Payment Data: Part D Prescriber (Part D), and (3) Medicare Provider Utilization and Payment Data: Referring Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS). Additionally, we create a fourth dataset which is a combination of the three primary datasets. We discuss data processing for all four datasets and the mapping of real-world provider fraud labels using the List of Excluded Individuals and Entities (LEIE) from the Office of the Inspector General. Our exploratory analysis on Medicare fraud detection involves building and assessing three learners on each dataset. Based on the Area under the Receiver Operating Characteristic (ROC) Curve performance metric, our results show that the Combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805. Overall, the Combined and Part B datasets produced the best fraud detection performance with no statistical difference between these datasets, over all the learners. Therefore, based on our results and the assumption that there is no way to know within which part of Medicare a physician will commit fraud, we suggest using the Combined dataset for detecting fraudulent behavior when a physician has submitted payments through any or all Medicare parts evaluated in our study.
topic Big Data
U.S. Medicare
LEIE
Fraud detection
url http://link.springer.com/article/10.1186/s40537-018-0138-3
work_keys_str_mv AT matthewherland bigdatafrauddetectionusingmultiplemedicaredatasources
AT taghimkhoshgoftaar bigdatafrauddetectionusingmultiplemedicaredatasources
AT richardabauder bigdatafrauddetectionusingmultiplemedicaredatasources
_version_ 1725223870903877632