Supplementary website for

“Reconstruction of genetic association networks from microarray data:

A partial least squares approach”

Vasyl Pihur, Somnath Datta, Susmita Datta*

*e-mail: susmita.datta_AT_louisville.edu

 

Abstract

 

Motivation: Gene association networks provide vast amounts of information about essential processes inside the cell. A complete picture of gene-gene interactions would open new horizons for biologists, ranging from pure appreciation to successful manipulation of biological pathways for therapeutic purposes. Therefore, identification of important biological complexes whose members (genes and their products proteins) interact with each other is of prime importance. Numerous experimental methods exist but, for the most part, they are costly and labor-intensive. Computational techniques, such as the one proposed in this work, provide a quick “budget” solution that can be used as a screening tool before more expensive techniques are attempted. Here, we introduce a novel computational method based on the Partial Least Squares (PLS) regression technique for reconstruction of genetic networks from microarray data.

 

Results: The proposed PLS based method is shown to be an effective screening procedure for the detection of gene-gene interactions from microarray data. Both simulated and real microarray experiments show that the PLS based approach is superior to its competitors both in terms of performance and applicability.

 

_________________________________________________________

 

 

Download

R-Code (distributed as is without warrantee; needs the base distribution of R and the R package locfdr from http://www.r-project.org/)

Additional Results for the Simulated Data

Additional Results for the Real Data

Distribution function of the overall scores for two groups

 

_________________________________________________________

 

Effect of the number of PLS components on the performance of the genetic network procedure

 

 

 

 

# of components

fdr

1

2

3

4

5

6

7

8

9

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

0.0001

766

1293

989

754

579

428

333

261

204

156

51

24

17

16

15

15

16

16

16

16

16

16

16

16

16

16

0.01

1321

1861

1509

1232

1001

776

623

506

413

327

130

61

43

39

37

36

37

37

37

37

37

37

37

37

37

37

0.1

1894

2399

2084

1757

1503

1251

1052

877

721

592

267

144

108

98

96

96

97

96

96

96

96

96

96

96

96

96

0.2

2157

2679

2377

2055

1807

1531

1322

1122

937

779

364

211

162

152

148

150

152

151

151

151

151

151

151

151

151

151

0.3

2370

2921

2632

2312

2073

1780

1570

1345

1138

953

455

280

227

213

205

209

209

209

208

207

208

208

208

208

208

208

0.4

2569

3152

2883

2560

2323

2043

1820

1585

1343

1138

559

354

295

279

274

277

278

277

277

277

277

277

277

277

277

277

0.5

2780

3381

3125

2822

2594

2311

2095

1849

1588

1357

672

446

377

359

349

354

354

354

352

352

352

352

352

352

352

352

0.6

3010

3611

3381

3108

2905

2629

2409

2152

1881

1622

816

557

481

457

444

449

450

450

449

449

449

449

449

449

449

449

0.7

3287

3886

3666

3440

3256

2994

2799

2532

2261

1969

1006

701

619

592

571

579

582

582

579

579

579

579

579

579

579

579

0.8

3658

4205

4016

3837

3678

3450

3288

3048

2767

2464

1278

916

817

792

765

776

780

781

777

776

777

777

777

777

777

777

0.9

4219

4673

4556

4418

4255

4101

3968

3777

3551

3193

1767

1303

1174

1144

1096

1112

1119

1119

1113

1112

1114

1113

1113

1113

1113

1113

0.99

5072

5817

5543

5382

5264

5305

4967

4951

4766

4431

3199

2254

2115

2231

2025

2068

2089

2091

2078

2077

2080

2079

2079

2079

2079

2079

 

The average number of discovered interactions for the simulated data (over 10 datasets).  The larger the number of PLS components, the smaller the number of discovered interactions. At about 30 components, convergence for each level of fdr (false discovery rate) is observed.

 

_________________________________________________________

 

 

Performance of Scores based on estimated PLS coefficients

(see the last but one paragraph of Section 2 in the paper)

 

1.  The locfdr procedure may break down if one uses the estimated PLS coefficients as measures of association/interaction (not recommended). This table shows the number of such breakdowns with 100 simulated data sets.

 

# of Components

 

3

4

5

6

7

8

10

15

 

# of datasets for which locfdr breaks down

 

77

77

70

50

39

35

24

14

 

 

 

2.  The scores proposed in the paper results in better sensitivity/specificity than the corresponding procedure based on

estimated PLS coefficients (as in the last but one paragraph of Section 2 in the paper).

 

Since the  locfdr procedure may  break down while using the PLS coefficients, we can either calculated sensitivity/specificity only for those simulated data sets for which the locfdr control works or set the thresholds so that the same  number of "interactions" are discovered by the two procedures (note that the locfdr  control works with the scores introduced in the paper). In either case, the estimated PLS coefficients lead to worse performance than the scores introduced in the paper.

 

Here are some examples:

 

(In these plots, the term "PLS-like Scores" is used to indicate network discovery procedure using the scores introduced in the paper and the term "True PLS scores" is used to indicate the network discovery procedure using the estimated PLS coefficients)

 

 

 

3.  The scores proposed in the paper results in better performance than the corresponding procedure based on

estimated PLS coefficients even for the real data set as shown by the following table:

 

Fdr level

Discovered

PLS-like Scores True Positive

PLS Scores True Positive

0.01

29

4

0

0.1

118

17

1

0.13

160

17

1

0.16

210

19

1

0.2

272

22

15