Algorithms for Pattern Mining Relim -­ Midterm Presentation -­ Thomas Stening Thorsten Papenbrock Agenda 2 Relim Algorithm Performance and Result Analysis Future Work Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Agenda 3 Relim Algorithm Performance and Result Analysis Future Work Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm -­ Use Case 4 Medical Claim Analysis: ‡ 113000 transactions (= number of patients) ‡ 46 different item values (= different claims) ‡ 610934 items Æ ~5.4 items / transaction (= number of claims) Questions: ‡ Which claims occure often together? ‡ Are there any claims that correlate each other? Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm -­ Use Case 5 Claim Description Claim Description AMI Acute myocardial infarction METAB3 Other metabolic APPCHOL Appendicitis MISCHRT Miscellaneous cardiac ARTHSPIN Arthropathies MISCL1 Miscellaneous 1 CANCRA Cancer A MISCL5 Miscellaneous 3 CANCRB Cancer B MSC2a3 Miscellaneous 2 CANCRM Ovarian and metastatic cancer NEUMENT Other neurological CATAST Catastrophic conditions ODaBNCA Ingestions and benign tumors CHF Congestive heart failure PERINTL Perinatal period COPD Chronic obstructive pulmonary disorder PERVALV Pericarditis FLaELEC Fluid and electrolyte PNCRDZ Pancreatic disorders FXDISLC Fractures and dislocations PNEUM Pneumonia GIBLEED Gastrointestinal bleeding PRGNCY Pregnancy RENAL1 Acute renal failure RENAL2 Chronic renal failure RENAL3 Other renal RESPR4 Acute respiratory ROAMI Chest pain SEIZURE Seizure GIOBSENT Gastr. Inflam. bowel disease and obstruction GYNEC1 Gynecology GYNECA Gynecologic cancers HEART2 Other cardiac conditions HEART4 Atherosclerosis and peripheral vascular disease HEMTOL Non-­malignant hematologic SEPSIS Sepsis HIPFX Hip fracture SKNAUT Skin and autoimmune disorders INFEC4 All other infections STROKE Stroke LIVERDZ Liver disorders TRAUMA All other trauma METAB1 Diabetic ketoacidosis and related metabolic UTI Urinary tract infections Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm -­ Use Case 6 Claim Description Claim Description AMI Acute myocardial infarction METAB3 Other metabolic APPCHOL Appendicitis MISCHRT Miscellaneous cardiac ARTHSPIN Arthropathies MISCL1 Miscellaneous 1 CANCRA Cancer A MISCL5 Miscellaneous 3 CANCRB Cancer B MSC2a3 Miscellaneous 2 CANCRM Ovarian and metastatic cancer NEUMENT Other neurological CATAST Catastrophic conditions ODaBNCA Ingestions and benign tumors CHF Congestive heart failure PERINTL Perinatal period COPD Chronic obstructive pulmonary disorder PERVALV Pericarditis FLaELEC Fluid and electrolyte PNCRDZ Pancreatic disorders FXDISLC Fractures and dislocations PNEUM Pneumonia GIBLEED Gastrointestinal bleeding PRGNCY Pregnancy RENAL1 Acute renal failure RENAL2 Chronic renal failure RENAL3 Other renal RESPR4 Acute respiratory ROAMI Chest pain SEIZURE Seizure GIOBSENT Gastr. Inflam. bowel disease and obstruction GYNEC1 Gynecology GYNECA Gynecologic cancers HEART2 Other cardiac conditions HEART4 Atherosclerosis and peripheral vascular disease HEMTOL Non-­malignant hematologic SEPSIS Sepsis HIPFX Hip fracture SKNAUT Skin and autoimmune disorders INFEC4 All other infections STROKE Stroke LIVERDZ Liver disorders TRAUMA All other trauma METAB1 Diabetic ketoacidosis and related metabolic UTI Urinary tract infections Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 MSC2a3
ARTHSPIN
METAB3
RESPR4
NEUMENT
SKNAUT
INFEC4
MISCHRT
GIBLEED
TRAUMA
MISCL5
ODaBNCA
ROAMI
UTI
RENAL3
COPD
GYNEC1
HEART4
HEART2
HEMTOL
FXDISLC
AMI
SEIZURE
CANCRB
NON_DEFINED
GIOBSENT
APPCHOL
PRGNCY
PNEUM
CHF
MISCL1
STROKE
RENAL2
FLaELEC
LIVERDZ
METAB1
GYNECA
PERVALV
CATAST
HIPFX
CANCRA
PERINTL
PNCRDZ
CANCRM
RENAL1
SEPSIS
Relim Algorithm -­ Use Case 7 Frequency
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm -­ Motivation 8 ‡ Aim: Test and optimize an alternative algorithm to todays most common algorithms ‡ Comparison: Name Simplicity Performance Apriori Easier Slower Eclat Harder Similar (*) FP-­Growth Harder / Similar Faster Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Datastructure Generation 9 1. Load transactions (in memory) a d f c d e b d a b c d b c a b d b d e b c e g c d f a b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Datastructure Generation 10 1. Load transactions (in memory) 2. Count item frequencies 1. Iteration a d f c d e b d a b c d b c a b d b d e b c e g c d f g 1 f 2 e 3 a 4 c 5 b 7 d 8 a b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Datastructure Generation 11 1. Load transactions (in memory) 2. Count item frequencies 1. Iteration 3. Delete all rare items from the transactions a d f c d e b d a b c d b c a b d b d e b c e g c d f g 1 f 2 e 3 a 4 c 5 b 7 d 8 a b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Datastructure Generation 12 1. Load transactions (in memory) 2. Count item frequencies 1. Iteration 3. Delete all rare items from the transactions 4. Sort each transaction according the items frequency a d f c d e b d a b c d b c a b d b d e b c e g c d f a b d a d g 1 f 2 e 3 a 4 c 5 b 7 d 8 e c d b d a c b d c b a b d e b d e c b c d a b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Datastructure Generation 13 1. Load transactions (in memory) 1. Iteration 2. Count item frequencies 3. Delete all rare items from the transactions 4. Sort each transaction according the items frequency 2. Iteration 5. Create Relim datastructure a d f c d e b d a b c d b c a b d b d e b c e g c d f a b d a d g 1 f 2 e 3 a 4 c 5 b 7 d 8 e c d e a c b d b d 3 4 2 1 0 a c b d c b a b d e b d e c b c d c d d b b d c b d d c b b d b d a b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 d Relim Algorithm ± Datastructure Generation 14 1. Load transactions (in memory) 1. Iteration 2. Count item frequencies 3. Delete all rare items from the transactions 4. Sort each transaction according the items frequency 2. Iteration 5. Create Relim datastructure a d f c d e b d a b c d b c a b d b d e b c e g c d f a b d a d g 1 f 2 e 3 a 4 c 5 b 7 d 8 e c d e a c b d b d 3 4 2 1 0 a c b d c b a b d e b d e c b c d c d d b b d c b d d c b b d b d a b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 d Relim Algorithm ± Recursive Tree Processing 15 e a c b d 3 4 2 1 0 c d d b b d c b d d c b b d d b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Recursive Tree Processing 16 e a c b d 3 4 2 1 0 c d d b b d c b d d c b b d d b d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Recursive Tree Processing 17 e a c b d 3 4 2 1 0 c d d b b d c b d d c b b d b d d Side recursion, Prefix: e e a c b d 0 0 2 1 0 b d d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Recursive Tree Processing 18 e a c b d 3 4 2 1 0 c d d b b d c b d d c b b d b d d Side recursion, Prefix: e e a c b d 0 0 2 1 0 b d d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Recursive Tree Processing 19 e a c b d 3 4 2 1 0 c d d b b d c b d d c b b d d Side recursion, Prefix: e b d Main recursion e a c b d e a c b d 0 4 4 2 0 0 0 2 1 0 d b d b c b d d d d b d b b d d d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Relim Algorithm ± Recursive Tree Processing 20 e a c b d 3 4 2 1 0 c d d b b d c b d d c b b d d Side recursion, Prefix: e b d Main recursion e a c b d e a c b d 0 4 4 2 0 0 0 2 1 0 d b d b c b d d d d b d b b d d d Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Agenda 21 Relim Algorithm Performance and Result Analysis Future Work Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Performance and Result Analysis ± Setup 22 System: -­ DELL StudioXPS -­ 64 bit Windows 7 Enterprise -­ Intel Core i5 M520 2,40 GHz -­ 4 GB RAM Implementierung: -­ Java -­ Basis-­Algorithmus ohne algorithmische Optimierungen Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Performance and Result Analysis ± Scaling MinSupport 23 ms # 5000
1000
4500
900
4000
800
3500
700
3000
600
2500
500
2000
400
1500
300
1000
200
500
100
0
0
1
3
5
7
Execution Time
Frequent Item Sets
Transactions DifferentItems TransactionSize MinSupport 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 113000 # 46 # Ø 5.4 # -­ % Performance and Result Analysis ± Scaling Transactions ms # 2000
110
1800
1600
105
1400
1200
100
1000
800
95
Execution Time
Frequent Item Sets
600
400
90
Transactions DifferentItems 200
TransactionSize 0
85
1000
4000
7000
10000
13000
16000
19000
22000
25000
28000
31000
34000
37000
40000
43000
46000
49000
52000
55000
58000
61000
64000
67000
70000
73000
76000
79000
82000
85000
88000
91000
94000
97000
100000
24 MinSupport Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 -­ # 46 # Ø 5.4 # 10 % Performance and Result Analysis ± Scaling DifferentItems 25 ms # 1400
140
1200
120
1000
100
800
80
Execution Time
600
60
Frequent Item Sets
Average Transaction Size
400
40
Transactions 200
20
0
0
1
3
5
7
DifferentItems 80944 # -­ # TransactionSize Ø -­ # MinSupport 10 % 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Performance and Result Analysis ± Example Results 26 [ MISCHRT , METAB3 , ARTHSPIN ] ± 13188 # ‡ Miscellaneous cardiac Æ Herzerkrankung ‡ Other metabolic Æ Stoffwechselprobleme ‡ Arthropathies Æ Gelenkerkrankungen [ RESPR4 , METAB3 , ARTHSPIN ] ± 13123 # ‡ Acute respiratory Æ Atembeschwerden ‡ Other metabolic Æ Stoffwechselprobleme ‡ Arthropathies Æ Gelenkerkrankungen [ TRAUMA , NEUMENT ] ± 11826 # ‡ All other trauma Æ Trauma ‡ Other neurological Æ Nervensystemerkrankungen Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Agenda 27 Relim Algorithm Performance and Result Analysis Future Work Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Future Work 28 1. Fuzzy Datamining: ‡
Some claims might not have been detected or reported, but they are still very frequent ‡
Try to find Frequent Item Sets although data is incomplete 2. Use Case Specialization: ‡
Finding rules in the data ‡
Considering the time between claims might deliver better results for the Frequent Claim Sets 3. Parallelisation: ‡
Transforming recursion steps into thread branches might increase the performance Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012 Sources 29 Paper: (Links Stand 10.05.12) ‡ ³Keeping Things Simple: Finding Frequent Item Sets by Recursive (OLPLQDWLRQ´Christian Borgelt, http://www.borgelt.net/relim.html ‡ ³0LQLQJFuzzy Frequent Item 6HWV´Xiaomeng Wang, Christian Borgelt, and Rudolf Kruse, http://www.borgelt.net/relix.html Bilder: (Links Stand 10.05.12) ‡ http://regrounding.files.wordpress.com/2011/07/doctor-­talking-­to-­
patient.jpg ‡ http://searchtrafficpro.com/wp-­content/uploads/2010/02/mining-­
the-­search-­query-­report.jpg# ‡ http://www.bogensportwelt.de/bilder/produkte/gross/523208_Fern
glas-­ZEISS-­Conquest-­8-­x-­30-­T.jpg ‡ http://www.wittewarenhandel.de/bilder/156227.jpg Pattern Mining -­ Relim | Thomas Stening, Thorsten Papenbrock | 15. Mai 2012