Big Data fraud detection using multiple medicare data sources
Abstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are indiv...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2018-09-01
|
Series: | Journal of Big Data |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s40537-018-0138-3 |
id |
doaj-3a30ebad615e46278f96366d7aa6ee31 |
---|---|
record_format |
Article |
spelling |
doaj-3a30ebad615e46278f96366d7aa6ee312020-11-25T00:57:30ZengSpringerOpenJournal of Big Data2196-11152018-09-015112110.1186/s40537-018-0138-3Big Data fraud detection using multiple medicare data sourcesMatthew Herland0Taghi M. Khoshgoftaar1Richard A. Bauder2Florida Atlantic UniversityFlorida Atlantic UniversityFlorida Atlantic UniversityAbstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are individuals who commit fraud for nefarious reasons and personal gain, limiting Medicare’s ability to effectively provide for the healthcare needs of the elderly and other qualifying people. To minimize fraudulent activities, the Centers for Medicare and Medicaid Services (CMS) released a number of “Big Data” datasets for different parts of the Medicare program. In this paper, we focus on the detection of Medicare fraud using the following CMS datasets: (1) Medicare Provider Utilization and Payment Data: Physician and Other Supplier (Part B), (2) Medicare Provider Utilization and Payment Data: Part D Prescriber (Part D), and (3) Medicare Provider Utilization and Payment Data: Referring Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS). Additionally, we create a fourth dataset which is a combination of the three primary datasets. We discuss data processing for all four datasets and the mapping of real-world provider fraud labels using the List of Excluded Individuals and Entities (LEIE) from the Office of the Inspector General. Our exploratory analysis on Medicare fraud detection involves building and assessing three learners on each dataset. Based on the Area under the Receiver Operating Characteristic (ROC) Curve performance metric, our results show that the Combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805. Overall, the Combined and Part B datasets produced the best fraud detection performance with no statistical difference between these datasets, over all the learners. Therefore, based on our results and the assumption that there is no way to know within which part of Medicare a physician will commit fraud, we suggest using the Combined dataset for detecting fraudulent behavior when a physician has submitted payments through any or all Medicare parts evaluated in our study.http://link.springer.com/article/10.1186/s40537-018-0138-3Big DataU.S. MedicareLEIEFraud detection |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Matthew Herland Taghi M. Khoshgoftaar Richard A. Bauder |
spellingShingle |
Matthew Herland Taghi M. Khoshgoftaar Richard A. Bauder Big Data fraud detection using multiple medicare data sources Journal of Big Data Big Data U.S. Medicare LEIE Fraud detection |
author_facet |
Matthew Herland Taghi M. Khoshgoftaar Richard A. Bauder |
author_sort |
Matthew Herland |
title |
Big Data fraud detection using multiple medicare data sources |
title_short |
Big Data fraud detection using multiple medicare data sources |
title_full |
Big Data fraud detection using multiple medicare data sources |
title_fullStr |
Big Data fraud detection using multiple medicare data sources |
title_full_unstemmed |
Big Data fraud detection using multiple medicare data sources |
title_sort |
big data fraud detection using multiple medicare data sources |
publisher |
SpringerOpen |
series |
Journal of Big Data |
issn |
2196-1115 |
publishDate |
2018-09-01 |
description |
Abstract In the United States, advances in technology and medical sciences continue to improve the general well-being of the population. With this continued progress, programs such as Medicare are needed to help manage the high costs associated with quality healthcare. Unfortunately, there are individuals who commit fraud for nefarious reasons and personal gain, limiting Medicare’s ability to effectively provide for the healthcare needs of the elderly and other qualifying people. To minimize fraudulent activities, the Centers for Medicare and Medicaid Services (CMS) released a number of “Big Data” datasets for different parts of the Medicare program. In this paper, we focus on the detection of Medicare fraud using the following CMS datasets: (1) Medicare Provider Utilization and Payment Data: Physician and Other Supplier (Part B), (2) Medicare Provider Utilization and Payment Data: Part D Prescriber (Part D), and (3) Medicare Provider Utilization and Payment Data: Referring Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS). Additionally, we create a fourth dataset which is a combination of the three primary datasets. We discuss data processing for all four datasets and the mapping of real-world provider fraud labels using the List of Excluded Individuals and Entities (LEIE) from the Office of the Inspector General. Our exploratory analysis on Medicare fraud detection involves building and assessing three learners on each dataset. Based on the Area under the Receiver Operating Characteristic (ROC) Curve performance metric, our results show that the Combined dataset with the Logistic Regression (LR) learner yielded the best overall score at 0.816, closely followed by the Part B dataset with LR at 0.805. Overall, the Combined and Part B datasets produced the best fraud detection performance with no statistical difference between these datasets, over all the learners. Therefore, based on our results and the assumption that there is no way to know within which part of Medicare a physician will commit fraud, we suggest using the Combined dataset for detecting fraudulent behavior when a physician has submitted payments through any or all Medicare parts evaluated in our study. |
topic |
Big Data U.S. Medicare LEIE Fraud detection |
url |
http://link.springer.com/article/10.1186/s40537-018-0138-3 |
work_keys_str_mv |
AT matthewherland bigdatafrauddetectionusingmultiplemedicaredatasources AT taghimkhoshgoftaar bigdatafrauddetectionusingmultiplemedicaredatasources AT richardabauder bigdatafrauddetectionusingmultiplemedicaredatasources |
_version_ |
1725223870903877632 |