Big Data fraud detection using multiple medicare data sources

Abstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are indiv...

Full description

Bibliographic Details
Main Authors:	Matthew Herland, Taghi M. Khoshgoftaar, Richard A. Bauder
Format:	Article
Language:	English
Published:	SpringerOpen 2018-09-01
Series:	Journal of Big Data
Subjects:	Big Data U.S. Medicare LEIE Fraud detection
Online Access:	http://link.springer.com/article/10.1186/s40537-018-0138-3

id	doaj-3a30ebad615e46278f96366d7aa6ee31
record_format	Article
spelling	doaj-3a30ebad615e46278f96366d7aa6ee312020-11-25T00:57:30ZengSpringerOpenJournal of Big Data2196-11152018-09-015112110.1186/s40537-018-0138-3Big Data fraud detection using multiple medicare data sourcesMatthew Herland0Taghi M. Khoshgoftaar1Richard A. Bauder2Florida Atlantic UniversityFlorida Atlantic UniversityFlorida Atlantic UniversityAbstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are individuals who commit fraud for nefarious reasons and personal gain, limiting Medicare’s ability to effectively provide for the healthcare needs of the elderly and other qualifying people. To minimize fraudulent activities, the Centers for Medicare and Medicaid Services (CMS) released a number of “Big Data” datasets for different parts of the Medicare program. In this paper, we focus on the detection of Medicare fraud using the following CMS datasets: (1) Medicare Provider Utilization and Payment Data: Physician and Other Supplier (Part B), (2) Medicare Provider Utilization and Payment Data: Part D Prescriber (Part D), and (3) Medicare Provider Utilization and Payment Data: Referring Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS). Additionally, we create a fourth dataset which is a combination of the three primary datasets. We discuss data processing for all four datasets and the mapping of real-world provider fraud labels using the List of Excluded Individuals and Entities (LEIE) from the Office of the Inspector General. Our exploratory analysis on Medicare fraud detection involves building and assessing three learners on each dataset. Based on the Area under the Receiver Operating Characteristic (ROC) Curve performance metric, our results show that the Combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805. Overall, the Combined and Part B datasets produced the best fraud detection performance with no statistical difference between these datasets, over all the learners. Therefore, based on our results and the assumption that there is no way to know within which part of Medicare a physician will commit fraud, we suggest using the Combined dataset for detecting fraudulent behavior when a physician has submitted payments through any or all Medicare parts evaluated in our study.http://link.springer.com/article/10.1186/s40537-018-0138-3Big DataU.S. MedicareLEIEFraud detection
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Matthew Herland Taghi M. Khoshgoftaar Richard A. Bauder
spellingShingle	Matthew Herland Taghi M. Khoshgoftaar Richard A. Bauder Big Data fraud detection using multiple medicare data sources Journal of Big Data Big Data U.S. Medicare LEIE Fraud detection
author_facet	Matthew Herland Taghi M. Khoshgoftaar Richard A. Bauder
author_sort	Matthew Herland
title	Big Data fraud detection using multiple medicare data sources
title_short	Big Data fraud detection using multiple medicare data sources
title_full	Big Data fraud detection using multiple medicare data sources
title_fullStr	Big Data fraud detection using multiple medicare data sources
title_full_unstemmed	Big Data fraud detection using multiple medicare data sources
title_sort	big data fraud detection using multiple medicare data sources
publisher	SpringerOpen
series	Journal of Big Data
issn	2196-1115
publishDate	2018-09-01
description	Abstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are individuals who commit fraud for nefarious reasons and personal gain, limiting Medicare’s ability to effectively provide for the healthcare needs of the elderly and other qualifying people. To minimize fraudulent activities, the Centers for Medicare and Medicaid Services (CMS) released a number of “Big Data” datasets for different parts of the Medicare program. In this paper, we focus on the detection of Medicare fraud using the following CMS datasets: (1) Medicare Provider Utilization and Payment Data: Physician and Other Supplier (Part B), (2) Medicare Provider Utilization and Payment Data: Part D Prescriber (Part D), and (3) Medicare Provider Utilization and Payment Data: Referring Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS). Additionally, we create a fourth dataset which is a combination of the three primary datasets. We discuss data processing for all four datasets and the mapping of real-world provider fraud labels using the List of Excluded Individuals and Entities (LEIE) from the Office of the Inspector General. Our exploratory analysis on Medicare fraud detection involves building and assessing three learners on each dataset. Based on the Area under the Receiver Operating Characteristic (ROC) Curve performance metric, our results show that the Combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805. Overall, the Combined and Part B datasets produced the best fraud detection performance with no statistical difference between these datasets, over all the learners. Therefore, based on our results and the assumption that there is no way to know within which part of Medicare a physician will commit fraud, we suggest using the Combined dataset for detecting fraudulent behavior when a physician has submitted payments through any or all Medicare parts evaluated in our study.
topic	Big Data U.S. Medicare LEIE Fraud detection
url	http://link.springer.com/article/10.1186/s40537-018-0138-3
work_keys_str_mv	AT matthewherland bigdatafrauddetectionusingmultiplemedicaredatasources AT taghimkhoshgoftaar bigdatafrauddetectionusingmultiplemedicaredatasources AT richardabauder bigdatafrauddetectionusingmultiplemedicaredatasources
_version_	1725223870903877632

Big Data fraud detection using multiple medicare data sources

Similar Items