I am still new to Java, and I am currently working on a program that will take two strings as arguments and return the number of mismatched pairs. For my program I am working with ATGC because in science, A's always match up with T's and G's always match up with C's. I cant quite figure out how to iterate over the strings and see that the first character in string one (T for example) matches up with its intended pair (A), and if it doesn't it is a mismatched pair and it should be added to a counter to be totaled at the end. I believe I can use something called charAt(), but I am unsure of how that works.
I also need to figure out how to be able to take the absolute value of counter before it is added to the finalCounter. The main reason for this is because I just want to worry about getting the length difference between the two rather than making sure that the longer string is subracted from the smaller string.
Any help would be greatly appreciated!
''''
public class CountMismatches {
public static void main(String[] args) {
{
String seq1 = "TTCGATGGAGCTGTA";
String seq2 = "TAGCTAGCTCGGCATGA";
System.out.println(count_mismatches(seq1, seq2))
//*expected to print out 5 because there are 3 mismatched pairs and 2 that do not have a pair*
}
}
public static int count_mismatches(String seq1, String seq2) {
int mismatchCount = 0;
int counter = seq1.length() - seq2.length();
int finalCounter = mismatchCount counter;
for(int i = 0; i < seq1.length(); i ) if (seq1.charAt(i) == seq2.charAt(i)) {
break; //checks to see if the length of seq1 and seq2 are the same
}
for(int i = 0; i < seq1.length(); i ) if (seq1.charAt(i) != seq2.charAt(i)) {
return counter; //figure out how to do absolute value for negative numbers
}
return finalCounter;
}
}
'''
CodePudding user response:
Since you want to count only the places where there are differences, you can iterate through the minimum length present in both the strings and find out the places where they are different. In the end, you can add absolute difference of length between seq1 and seq2 and return that value to the main function. For the logic, all you have to do is apply 4 if conditions to check if character is A,G,C,T and if suitable pair is present in the other string.
public class CountMismatches {
public static void main(String[] args) {
{
String seq1 = "TTCGATGGAGCTGTA";
String seq2 = "TAGCTAGCTCGGCATGA";
System.out.println(count_mismatches(seq1, seq2));
}
}
public static int count_mismatches(String seq1, String seq2) {
int finalCounter = 0;
for (int i = 0; i < Math.min(seq1.length(), seq2.length()); i ) {
char c1 = seq1.charAt(i);
char c2 = seq2.charAt(i);
if (c1 == 'A') {
if (c2 == 'T')
continue;
else
finalCounter ;
} else if (c1 == 'T') {
if (c2 == 'A')
continue;
else
finalCounter ;
} else if (c1 == 'G') {
if (c2 == 'C')
continue;
else
finalCounter ;
} else if (c1 == 'C') {
if (c2 == 'G')
continue;
else
finalCounter ;
}
}
return finalCounter (Math.abs(seq1.length() - seq2.length()));
}
}
and the output is as follows :
5
CodePudding user response:
Make these refactorings:
- To make the comparisons easy to code and understand, create a
Mapwhose entires are each pair (both directions) - Iterate over the Strings up to the length of the shortest one, adding up the number of matching pairs as you go
- The result is the length of the longest String minus the number of pairs
Like this:
public static int count_mismatches(String seq1, String seq2) {
Map<Character, Character> pairs = Map.of('A', 'T', 'T', 'A', 'G', 'C', 'C', 'G');
int count = 0;
for (int i = 0; i < Math.min(seq1.length(), seq2.length()); i ) {
if (pairs.get(seq1.charAt(i)) == seq2.charAt(i)) {
count ;
}
}
return Math.max(seq1.length(), seq2.length()) - count;
}
See live demo, which returns 5 for your sample input.
CodePudding user response:
Good Evening,
Something seems off here, this snippet of code:
for(int i = 0; i < seq1.length(); i )
if (seq1.charAt(i) == seq2.charAt(i)) {
break; //checks to see if the length of seq1 and seq2 are the same
}
Does not do what you think it does. This cycle will loop through all characters in sequence1 using i < seq1.length() and for each character that exists in seq1, it will check if said character is equal to the character with the same index in seq2.
This means that a correction is in order:
int countMismatches = 0;
for(int i = 0; i < seq1.length();i ){
switch(seq1.charAt(i)){
case 'A':
if(seq2.charAt(i) != 'T') countMismatches ;
break;
}
}
Repeat this process for the other letters, and voilá, you should be able to count your mismatches this way.
Do be careful with sequences having different lengths, as if that happens, as soon as you step out of a bound, you will receive an IndexOutOfBoundsException, indicating you've tried to check a character that does not exist.
CodePudding user response:
First you must find out which string is the shortest in length. Also you need to get the length difference when calculating the shortest string. After that, use that length as a terminating condition in your for loop. You can use booleans to check whether the values are present before incrementing the counter with an if statement.
The absolute value of any number can be obtained by calling the static method abs() from the Math class. Last, just add the mismatchCounts to the absolute value of the length difference in order to obtain the result.
Here is my solution.
public class App {
public static void main(String[] args) throws Exception {
String seq1 = "TTCGATGGAGCTGTA";
String seq2 = "TAGCTAGCTCGGCATGA";
System.out.println(compareStrings(seq1, seq2));
}
public static int compareStrings(String stringOne, String stringTwo) {
Character A = 'A', T = 'T', G = 'G', C = 'C';
int mismatchCount = 0;
int lowestStringLenght = 0;
int length_one = stringOne.length();
int length_two = stringTwo.length();
int lenght_difference = 0;
if (length_one < length_two) {// string one lenght is greater
lowestStringLenght = length_one;
lenght_difference = length_one - length_two;
} else if (length_one > length_two) {// string two lenght is greater
lowestStringLenght = length_two;
lenght_difference = length_two - length_one;
} else { // lenghts must be equal, use either
lowestStringLenght = length_one;
lenght_difference = 0; // there is no difference because they are equal
}
for (int i = 0; i < lowestStringLenght; i ) {
// A matches with T
// G matches with C
// evaluate if the values A, T, G, C are present
boolean A_T_PRESENT = stringOne.charAt(i) == A && stringTwo.charAt(i) == T;
boolean G_C_PRESENT = stringOne.charAt(i) == G && stringTwo.charAt(i) == C;
boolean T_A_PRESENT = stringOne.charAt(i) == T && stringTwo.charAt(i) == A;
boolean C_G_PRESENT = stringOne.charAt(i) == C && stringTwo.charAt(i) == G;
boolean TWO_EQUAL = stringOne.charAt(i) == stringTwo.charAt(i);
// characters are equal, increase mismatch counter
if (TWO_EQUAL) {
mismatchCount ;
continue;
}
// all booleans evaluated to false, it means that the characters are not proper
// matches. Increment mismatchCount
else if (!A_T_PRESENT && !G_C_PRESENT && !T_A_PRESENT && !C_G_PRESENT) {
mismatchCount ;
continue;
} else {
continue;
}
}
// calculate the sum of the mismatches plus the abs of the lenght difference
lenght_difference = Math.abs(lenght_difference);
return mismatchCount lenght_difference;
}
}
CodePudding user response:
Avoid char
The char type is legacy, essentially broken. As a 16-bit value, char is physically incapable of representing most characters. The char type in your particular case would work. But using char is a bad habit generally, as such code may break when encountering any of about 75,000 characters defined in Unicode.
Code point
Use code point integer numbers instead. A code point is the number assigned to each of the over 140,000 characters defined by the Unicode Consortium.
Here we get an IntStream, a series of int values, one for each character in the input string. Then we collect these integer numbers into an array of int values.
int[] codePoints1 = seq1.codePoints().toArray() ;
int[] codePoints2 = seq2.codePoints().toArray() ;
You said the input strings may be of unequal length. So our two arrays may be jagged, of different lengths. Figure out the size of the shorter array.
int smallerSize = Math.min( codePoints1.length , codePoints2.length ) ;
Keep track of the index number of mismatched rows.
List<Integer> mismatchIndices = new ArrayList <>();
Loop the arrays based on that smaller size.
for( int i = 0 ; i < smallerSize ; i )
{
if ( isBasePairValid( codePoint first , codePoint second ) )
{
…
} else
{
mismatchIndices.add( i ) ;
}
}
Write an isBasePairValid method
Write the isBasePairValid method, taking two arguments, the code points of the two nucleobase letters.
static int A = "A".codePointAt( 0 ) ; // Annoying zero-based index counting. So first character is number zero.
static int C = "C".codePointAt( 0 ) ;
static int G = "G".codePointAt( 0 ) ;
static int T = "T".codePointAt( 0 ) ;
if( first == A ) return ( second == T )
else if( first == T ) return ( second == A )
else if( first == C ) return ( second == G )
else if( first == G ) return ( second == C )
else { throw new IllegalStateException( … ) ; }
Count the mismatches.
int countMismatches = mismatchIndices.size() ;
CodePudding user response:
The numerical sum of chars T & A and G & C is fixed and unique for legal nucleobase pairs. So you just need to ensure that the corresponding bases have one of those sums.
String seq1 = "TTCGATGGAGCTGTA";
String seq2 = "TAGCTAGCTCGGCATGA";
System.out.println(count_mismatches(seq1, seq2));
prints
5
- find max length to iterate
- establish fixed sums for comparison
- iterate and compare to expected pairing and update count appropriately
public static int count_mismatches(String seq1, String seq2) {
int len1 = seq1.length();
int len2 = seq2.length();
int len = len1;
if (len1 > len2) {
len = len2;
}
int sumTA = 'T' 'A';
int sumGC = 'G' 'C';
int misMatchCount = Math.abs(len1-len2);
for (int i = 0; i < len; i ) {
int pair = seq1.charAt(i) seq2.charAt(i);
if (pair != sumTA && pair != sumGC) {
misMatchCount ;
}
}
return misMatchCount;
}
