The sensitivity of SETIEasy should be increased by replacing the droot=4 Minkowski metric in the full hyperspace with a projected distance on the hyperline (as used in QBEST, e.g., http://www.pharm.uky.edu/ASRG/pdfs/atheromas.pdf). The hyperspace problem is one we have seen before in Speakeasy implementations of the SOB algorithm. In SETIEasy v1.33 the C code:
float Statpack1::fsob(float **b_train,float *c_train,
float **b_test,float *c_test){
unsigned long j,n,k;
unsigned long m = b*2;
float r=0.0;
float xsum=0.0,ysum=0.0;
float xysum=0.0;
float xsqsum=0.0,ysqsum=0.0;
float nnnum=0.0,dddenom=0.0;
unsigned long d = cols;
unsigned long cdf_len = b*2;
float ntest = sqrt(b);
float ntn = sqrt(b);
float *ecdf;
float *tcdf;
float droot = 4;
ecdf = new float[cdf_len+1];
tcdf = new float[cdf_len+1];
for(n=0;n<b;++n){
for(k=0;k<d;++k){
ecdf[n] += pow(fabs(b_train[n][k]-c_train[k]), droot)*ntn;
ecdf[n+b] += pow(fabs(b_test[n][k]-c_train[k]), droot)*ntest;
}
ecdf[n] = sqrt(sqrt(ecdf[n]));
ecdf[n+b] = sqrt(sqrt(ecdf[n+b]));
tcdf[n]=ecdf[n];
tcdf[n+b]=ecdf[n];
}
sortarray(ecdf-1, cdf_len);
sortarray(tcdf-1, cdf_len);
unsigned long templen = ((m-(m/10))-((m/10)+1))+1;
for(j=0; j<templen; ++j){
xsum += ecdf[j];
ysum += tcdf[j];
xysum += ecdf[j]*tcdf[j];
xsqsum += ecdf[j]*ecdf[j];
ysqsum += tcdf[j]*tcdf[j];
}
nnnum=(templen*xysum)-(xsum*ysum);
dddenom=sqrt((templen*xsqsum)-(xsum*xsum));
dddenom=dddenom*sqrt((templen*ysqsum)-(ysum*ysum));
r=nnnum/dddenom;
delete [] ecdf;
delete [] tcdf;
return r;
}
is essentially the same thing as the Speakeasy
LISTING OF SUBROUTINE SOB
1 SUBROUTINE SOB(BTRAIN,CTN,TESTSPEC,BNUM,ECDF,TCDF,R)
2 $ This routine takes 2 spectral groups (with = numbers of samples)
3 $ and calculates an ECDF and TCDF for QQ plotting. The routine
4 $ also returns a correlation coefficient between the 2 CDFs.
5 $ Training replicates and center must be provided, along with
6 $ test spectra. BNUM = the number of replicates used.
7 $ Note: you may have to set EPSILON(1e-200) or greater to prevent
8 $ underflow errors in high dimensional hyperspaces
9 REPLICA(TESTSPEC,BNUM,BTEST,CTEST) $ create test set replicates
10 M = NOROWS(BTRAIN)*2 $ number of replicates in CDFs
11 P = 0.1 $ extent of trimming
12 D = NOCOLS(BTRAIN) $ dimensions in hyperspace
13 ST = (SUMROWS((BTRAIN-CTN)**D))**(1/D) $ gives Minkowski distances
14 SX = (SUMROWS((BTEST -CTN)**D))**(1/D) $ gives Minkowski distances
15 TRIM = INTEGERS((M*P)+1,M-(M*P))
16 ECDF = ST,SX
17 TCDF = ST,ST
18 ECDF = RANKED(ECDF)
19 TCDF = RANKED(TCDF)
20 ECDF = ECDF(TRIM)
21 TCDF = TCDF(TRIM)
22 MULTIREGRESSION(TCDF,ECDF:RES,M)
23 R = M
24 END
Replacing SOB with PSOB should greatly increase the sensitivity of SETIEasy. The projected distances are calculated in lines 16-28 and 31-43.
LISTING OF SUBROUTINE PSOB
1 SUBROUTINE PSOB(BTRAIN,CTN,TESTSPEC,BNUM,ECDF,TCDF,R)
2 $ THIS ROUTINE TAKES 2 SPECTRAL GROUPS (WITH = NUMBERS OF SAMPLES)
3 $ AND CALCULATES AN ECDF AND TCDF FOR QQ PLOTTING. THE ROUTINE
4 $ ALSO RETURNS A CORRELATION COEFFICIENT BETWEEN THE 2 CDFS.
5 $ DISTANCES ARE ALL PROJECTED ON THE HYPERLINE CONNECTING THE
6 $ CENTERS OF THE TRAINING SET AND TEST SET.
7 $ TRAINING REPLICATES AND CENTER MUST BE PROVIDED, ALONG WITH
8 $ TEST SPECTRA. BNUM = THE NUMBER OF REPLICATES USED.
9 $ NOTE: YOU MAY HAVE TO SET EPSILON(1E-200) OR GREATER TO PREVENT
10 $ UNDERFLOW ERRORS IN HIGH DIMENSIONAL HYPERSPACES
11 REPLICA(TESTSPEC,BNUM,BTEST,CTEST) $ CREATE TEST SET REPLICATES
12 M = NOROWS(BTRAIN)*2 $ NUMBER OF REPLICATES IN CDFS
13 P = 0.1 $ EXTENT OF TRIMMING
14 D = 2 $ DIMENSIONS IN HYPERSPACE
15 RADFRAC = 1.0 $ RADIUS OF HYPERCYLINDER AS FRACTION
OF PTS
16 B = NOROWS(BTRAIN)
17 S02=SQRT(SUMSQ(CTEST-CTN))
18 S0R=SQRT(SUMSQROWS(BTRAIN-CTN))
19 S2R=SQRT(SUMSQROWS(BTRAIN-CTEST))
20 SUB=(S02+S0R+S2R)/2
21 AREA=SQRT(SUB*(SUB-S02)*(SUB-S0R)*(SUB-S2R))
22 RADIAL=(2*AREA)/S02
23 PROJECT=SQRT(S0R**2-RADIAL**2)
24 WHERE ((S02**2+S0R**2).LT.(S2R**2)) PROJECT=-PROJECT
25 QDIST=PROJECT
26 QR = RANKED(RADIAL) ; RADII = QR(RADFRAC*B) $ SET RADIUS
27 WHERE (RADIAL.GT.RADII) QDIST=0 $ ZEROS PTS. OUTSIDE CYLINDER
28 QDIST=QDIST(LOCS(QDIST)) $ THROW AWAY ZEROED POINTS
29 ST = QDIST
30 $
31 B = NOROWS(BTEST)
32 S02=SQRT(SUMSQ(CTEST-CTN))
33 S0R=SQRT(SUMSQROWS(BTEST-CTN))
34 S2R=SQRT(SUMSQROWS(BTEST-CTEST))
35 SUB=(S02+S0R+S2R)/2
36 AREA=SQRT(SUB*(SUB-S02)*(SUB-S0R)*(SUB-S2R))
37 RADIAL=(2*AREA)/S02
38 PROJECT=SQRT(S0R**2-RADIAL**2)
39 WHERE ((S02**2+S0R**2).LT.(S2R**2)) PROJECT=-PROJECT
40 QDIST=PROJECT
41 QR = RANKED(RADIAL) ; RADII = QR(RADFRAC*B) $ SET RADIUS
42 WHERE (RADIAL.GT.RADII) QDIST=0 $ ZEROS PTS. OUTSIDE CYLINDER
43 QDIST=QDIST(LOCS(QDIST)) $ THROW AWAY ZEROED POINTS
44 SX = QDIST
45 TRIM = INTEGERS((M*P)+1,M-(M*P))
46 ECDF = ST,SX
47 TCDF = ST,ST
48 ECDF = RANKED(ECDF)
49 TCDF = RANKED(TCDF)
50 ECDF = ECDF(TRIM)
51 TCDF = TCDF(TRIM)
52 MULTIREGRESSION(TCDF,ECDF:RES,M)
53 R = M
54 END